简体   繁体   English

Pandas 将函数应用于数据帧的每一行以返回每个条目的一个或多个新行

[英]Pandas apply function to each row of a dataframe to return one or more new rows per entry

I have a dataset like the following:我有一个如下所示的数据集:

import pandas as pd
df = pd.DataFrame([[[{'name': 'Joe', 'age': 32, 'category': 'A'}, {'name': 'Jane', 'age': 35, 'category': 'A'}]],
                   [[{'name': 'Beth', 'age': 33, 'category': 'B'}, {'name': 'Bob', 'age': 32, 'category': 'B'}]],
                   [[{'name': 'John', 'age': 35, 'category': 'C'}]],
                   [[{'name': 'Jill', 'age': 33, 'category': 'D'}]],
                   ], columns=['Entries'])

The dataframe has a single column (named 'Entries'), where each row contains a list of one or more dictionaries.数据框只有一列(名为“条目”),其中每一行包含一个或多个字典的列表。

I need a way to convert the dataframe for each key in the dictionary to become a column, and for the values to appear in those corresponding columns, like so:我需要一种方法将字典中每个键的数据框转换为一列,并将值显示在相应的列中,如下所示:

    age category  name
0  32.0        A   Joe
1  35.0        A  Jane
2  33.0        B  Beth
3  32.0        B   Bob
4  35.0        C  John
5  33.0        D  Jill

Currently I have the following code to do this:目前我有以下代码来做到这一点:

df2 = pd.DataFrame()
for idx, row in df.iterrows():
    for entry in row.Entries:
        name = entry['name']
        age = entry['age']
        category = entry['category']

        single_entry = pd.Series({'name': name, 'age': age, 'category': category})
        df2 = df2.append(single_entry, ignore_index=True)

The code above works fine, but is very slow on my actual dataset, which has over 1,000,000 rows.上面的代码工作正常,但在我的实际数据集上非常慢,它有超过 1,000,000 行。

I considered using built-in Pandas functions to leverage their speed gains, for example the apply function, but I don't know how to apply this to this particular problem.我考虑过使用内置的 Pandas 函数来利用它们的速度增益,例如 apply 函数,但我不知道如何将其应用于这个特定问题。

What is a more efficient way to achieve the above result?达到上述结果的更有效方法是什么?

I suggest use list comprehension with flatten values for improve speed:我建议使用带有展平值的列表理解来提高速度:

df = pd.DataFrame([y for x in df['Entries'] for y in x])

Another idea:另一个想法:

from  itertools import chain

df = pd.DataFrame(chain.from_iterable(df['Entries'].tolist()))

print (df)
   name  age category
0   Joe   32        A
1  Jane   35        A
2  Beth   33        B
3   Bob   32        B
4  John   35        C
5  Jill   33        D

Performance with sample data repeated 10000 times for 40k rows:对 40k 行重复 10000 次样本数据的性能

df = pd.concat([df] * 10000, ignore_index=True)

In [222]: %timeit pd.DataFrame([y for x in df['Entries'] for y in x])
66.1 ms ± 770 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [223]: %timeit pd.DataFrame(chain.from_iterable(df['Entries'].tolist()))
60.9 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [224]: %timeit pd.DataFrame(itertools.chain(*df.Entries.tolist()))
60.8 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [225]: %timeit pd.DataFrame(sum(df.Entries.tolist(),[]))
3.94 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [226]: %timeit pd.DataFrame(df['Entries'].explode().tolist())
131 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

If you want to retain index from the records, explode would help如果您想保留记录中的索引,explode 会有所帮助

s=df['Entries'].explode()
pd.DataFrame(s.tolist(),index=s.index)

   name  age category
0   Joe   32        A
0  Jane   35        A
1  Beth   33        B
1   Bob   32        B
2  John   35        C
3  Jill   33        D

IIUC国际大学联盟

pd.DataFrame(sum(df.Entries.tolist(),[]))

   name  age category
0   Joe   32        A
1  Jane   35        A
2  Beth   33        B
3   Bob   32        B
4  John   35        C
5  Jill   33        D

Or或者

import itertools
pd.DataFrame(itertools.chain(*df.Entries.tolist()))
   name  age category
0   Joe   32        A
1  Jane   35        A
2  Beth   33        B
3   Bob   32        B
4  John   35        C
5  Jill   33        D

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将行分组到一个新的 Pandas DataFrame 中,每组一行 - Grouping rows into a new Pandas DataFrame with one row per group 将函数应用于pandas数据帧的每一行以创建两个新列 - Apply function to each row of pandas dataframe to create two new columns 如何将 function 应用于 pandas dataframe 中一列的每一行? - How to apply a function to each row of one column in a pandas dataframe? 为 Pandas DataFrame 每行返回多行 - Return multiple rows per row for pandas DataFrame 按组将函数应用于 Pandas 数据框中的每一行 - Apply function to each row in Pandas dataframe by group 如何将每列只有 1 个非空条目的 Pandas 数据框中的多行合并为一行? - How to combine multiple rows in a pandas dataframe which have only 1 non-null entry per column into one row? 用于将函数应用于 Pandas DataFrame 中的每一行的应用函数的替代方法 - Alternative to apply function for applying a function to each row in Pandas DataFrame Python:将时间序列应用于数据框的每一列并返回新行 - Python: Apply Time series to each column of a dataframe and return a new row 如果每列每行有多个值,如何在熊猫数据框中的两列之间创建字典? - How can I create a dictionary between two columns within a pandas dataframe if each column has more than one value per row? 将函数应用于pandas数据框列中每一行的每个单词 - apply function to each word of every row in pandas dataframe column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM