简体   繁体   English

熊猫:列值重合时相加

[英]Pandas: sum if columns values coincides

I'm trying to do an, apparently, simple operation in python: 我正在尝试在python中执行一个简单的操作:

I have some datasets, say 6, and I want to sum the values of one column if the values of the other two columns coincides. 我有一些数据集,例如6,如果其他两列的值重合,我想对一列的值求和。 After that, I want to divide the values of the column which has been summed by the number of datasets I have, in this case, 6 (ie Calculate the arithmetic mean). 之后,我想将已求和的列的值除以我拥有的数据集数量,在这种情况下为6(即计算算术平均值)。 Also I want to sum 0 if the values of the other columns doesn't coincide. 另外,如果其他列的值不一致,我想求和0。

I write down here two dataframes, as example: 我在这里写下两个数据框,例如:

Code1 Code2 Distance 0 15.0 15.0 2 1 15.0 60.0 3 2 15.0 69.0 2 3 15.0 434.0 1 4 15.0 842.0 0

Code1 Code2 Distance 0 14.0 15.0 4 1 14.0 60.0 7 2 15.0 15.0 0 3 15.0 60.0 1 4 15.0 69.0 9

The first column is the df.index column. 第一列是df.index列。 Then , I want to sum 'Distance' column only if 'Code1' and 'Code2' columns coincide. 然后,仅当“ Code1”和“ Code2”列重合时,我想对“ Distance”列进行求和。 In this case the desired output would be something like: 在这种情况下,所需的输出将类似于:

Code1 Code2 Distance 0 14.0 15.0 2 1 14.0 60.0 3.5 2 15.0 15.0 1 3 15.0 60.0 2 4 15.0 69.0 5.5 5 15.0 434.0 0.5 6 15.0 842.0 0

I've tried to do this using conditionals, but for more than two df is really hard to do. 我尝试使用条件条件来执行此操作,但是对于两个以上的df来说确实很难做到。 Is there any method in Pandas to do it faster? 熊猫有没有办法更快地做到这一点?

Any help would be appreciated :-) 任何帮助,将不胜感激 :-)

You could put all your data frames in a list and then use reduce to either append or merge them all. 您可以将所有数据框放在列表中,然后使用reduce appendmerge它们。 Take a look at reduce here . 这里看看reduce。

First, below some functions are defined for sample data generation. 首先,下面定义了一些用于样本数据生成的功能。

import pandas
import numpy as np

# GENERATE DATA
# Code 1 between 13 and 15
def generate_code_1(n):
    return np.floor(np.random.rand(n,1) * 3 + 13)

# Code 2 between 1 and 1000
def generate_code_2(n):
    return np.floor(np.random.rand(n,1) * 1000) + 1

# Distance between 0 and 9
def generate_distance(n):
    return np.floor(np.random.rand(n,1) * 10)

# Generate a data frame as hstack of 3 arrays
def generate_data_frame(n):
    data = np.hstack([
         generate_code_1(n)
        ,generate_code_2(n)
        ,generate_distance(n)
    ])
    df = pandas.DataFrame(data=data, columns=['Code 1', 'Code 2', 'Distance'])
    # Remove possible duplications of Code 1 and Code 2. Take smallest distance in case of duplications.
    # Duplications will break merge method however will not break append method
    df = df.groupby(['Code 1', 'Code 2'], as_index=False)
    df = df.aggregate(np.min)
    return df

# Generate n data frames each with m rows in a list
def generate_data_frames(n, m, with_count=False):
    df_list = []
    for k in range(0, n):
        df = generate_data_frame(m)
        # Add count column, needed for merge method to keep track of how many cases we have seen
        if with_count:
            df['Count'] = 1
        df_list.append(df)
    return df_list

Append method (faster, shorter, nicer) 追加方法(更快,更短,更好)

df_list = generate_data_frames(94, 5)

# Append all data frames together using reduce
df_append = reduce(lambda df_1, df_2 : df_1.append(df_2), df_list)

# Aggregate by Code 1 and Code 2
df_append_grouped = df_append.groupby(['Code 1', 'Code 2'], as_index=False)
df_append_result = df_append_grouped.aggregate(np.mean)
df_append_result

Merge method 合并方式

df_list = generate_data_frames(94, 5, with_count=True)

# Function to be passed to reduce. Merge 2 data frames and update Distance and Count
def merge_dfs(df_1, df_2):
    df = pandas.merge(df_1, df_2, on=['Code 1', 'Code 2'], how='outer', suffixes=('', '_y'))
    df = df.fillna(0)
    df['Distance'] = df['Distance'] + df['Distance_y']
    df['Count'] = df['Count'] + df['Count_y']
    del df['Distance_y']
    del df['Count_y']
    return df

# Use reduce to apply merge over the list of data frames
df_merge_result = reduce(merge_dfs, df_list)

# Replace distance with its mean and drop Count
df_merge_result['Distance'] = df_merge_result['Distance'] / df_merge_result['Count']
del df_merge_result['Count']
df_merge_result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM