简体   繁体   English

熊猫:组合数据框的有效方法

[英]Pandas: efficient way to combine dataframes

I'm looking for a more efficient way than pd.concat to combine two pandas DataFrames. 我正在寻找一种比pd.concat更有效的方法来组合两个熊猫DataFrame。

I have a large DataFrame (~7GB in size) with the following columns - "A", "B", "C", "D". 我有一个大型DataFrame(大小约为7GB),其中包含以下各列-“ A”,“ B”,“ C”,“ D”。 I want to groupby the frame by "A", then for each group: groupby by "B", average the "C" and sum the "D" and then combine all the results to one dataframe. 我想按“ A”对帧进行分组,然后对每个组:“ B”进行分组,对“ C”求平均值,对“ D”求和,然后将所有结果组合到一个数据帧中。 I've tried the following approaches - 我尝试了以下方法-

1) Creating an empty final DataFrame, Iterating the groupby of "A" doing the processing I need and than pd.concat each group the the final DataFrame. 1)创建一个空的最终DataFrame,迭代“ A”的groupby进行我需要的处理,然后pd.concat每个组最终的DataFrame。 The problem is that pd.concat is extremely slow. 问题是pd.concat非常慢。

2) Iterating through the groupby of "A", doing the processing I needed and than saving the result to a csv file. 2)遍历“ A”的groupby,进行所需的处理,然后将结果保存到csv文件中。 That's working ok but I want to find out if there is a more efficient way that doesn't involve all the I/O of writing to disk. 可以,但是我想找出是否有一种更有效的方法,该方法不涉及写入磁盘的所有I / O。

Code examples 代码示例

First approach - Final DataFrame with pd.concat: 第一种方法-带有pd.concat的最终DataFrame:

def pivot_frame(in_df_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    res_cols = in_df.columns.tolist()
    res = pd.DataFrame(columns=res_cols)
    g = in_df.groupby(by=["A"])
    for title, group in g:
        temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
        temp = temp.reset_index()
        temp.insert(0, "A", title)
        res = pd.concat([res, temp], ignore_index=True)
        temp.to_csv(f, mode='a', header=False, sep=DELIMITER)
    return res

Second approach - Writing to disk: 第二种方法-写入磁盘:

def pivot_frame(in_df_path, ouput_path):
    in_df = pd.read_csv(in_df_path, delimiter=DELIMITER)
    with open(ouput_path, 'w') as f:
        csv_writer = csv.writer(f, delimiter=DELIMITER)
        csv_writer.writerow(["A", "B", "C", "D"])
        g = in_df.groupby(by=["A"])
        for title, group in g:
            temp = group.groupby(by=["B"]).agg({"C": np.mean, "D": np.sum})
            temp = temp.reset_index()
            temp.insert(0, JOB_TITLE_COL, title)
            temp.to_csv(f, mode='a', header=False, sep=DELIMITER)

The second approach works way faster than the first one but I'm looking for something that would spare me the access to disk all the time. 第二种方法的工作方式比第一种方法快,但是我正在寻找一种可以使我一直无时无刻访问磁盘的东西。 I read about split-apply-combine (eg - https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html ) but I haven't found it helpful. 我阅读了有关split-apply-combine的信息(例如-https: //pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html ),但我发现它没有帮助。

Thanks a lot! 非常感谢! :) :)

Solved 解决了

So Niels Henkens comment really helped and the solution is to just - 因此Niels Henkens评论确实有帮助,解决方案是-

result = in_df.groupby(by=["A","B"]).agg({"C": np.mean, "D": np.sum})

Another improvement in performance is to use Dask - 性能的另一个改进是使用Dask-

import dask.dataframe as dd
df = dd.read_csv(PATH_TO_FILE, delimiter=DELIMITER)
g = df.groupby(by=["A", "B"]).agg({"C": np.mean, "D": np.sum}).compute().reset_index()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM