简体   繁体   中英

Pandas: sum if columns values coincides

I'm trying to do an, apparently, simple operation in python:

I have some datasets, say 6, and I want to sum the values of one column if the values of the other two columns coincides. After that, I want to divide the values of the column which has been summed by the number of datasets I have, in this case, 6 (ie Calculate the arithmetic mean). Also I want to sum 0 if the values of the other columns doesn't coincide.

I write down here two dataframes, as example:

Code1 Code2 Distance 0 15.0 15.0 2 1 15.0 60.0 3 2 15.0 69.0 2 3 15.0 434.0 1 4 15.0 842.0 0

Code1 Code2 Distance 0 14.0 15.0 4 1 14.0 60.0 7 2 15.0 15.0 0 3 15.0 60.0 1 4 15.0 69.0 9

The first column is the df.index column. Then , I want to sum 'Distance' column only if 'Code1' and 'Code2' columns coincide. In this case the desired output would be something like:

Code1 Code2 Distance 0 14.0 15.0 2 1 14.0 60.0 3.5 2 15.0 15.0 1 3 15.0 60.0 2 4 15.0 69.0 5.5 5 15.0 434.0 0.5 6 15.0 842.0 0

I've tried to do this using conditionals, but for more than two df is really hard to do. Is there any method in Pandas to do it faster?

Any help would be appreciated :-)

You could put all your data frames in a list and then use reduce to either append or merge them all. Take a look at reduce here .

First, below some functions are defined for sample data generation.

import pandas
import numpy as np

# GENERATE DATA
# Code 1 between 13 and 15
def generate_code_1(n):
    return np.floor(np.random.rand(n,1) * 3 + 13)

# Code 2 between 1 and 1000
def generate_code_2(n):
    return np.floor(np.random.rand(n,1) * 1000) + 1

# Distance between 0 and 9
def generate_distance(n):
    return np.floor(np.random.rand(n,1) * 10)

# Generate a data frame as hstack of 3 arrays
def generate_data_frame(n):
    data = np.hstack([
         generate_code_1(n)
        ,generate_code_2(n)
        ,generate_distance(n)
    ])
    df = pandas.DataFrame(data=data, columns=['Code 1', 'Code 2', 'Distance'])
    # Remove possible duplications of Code 1 and Code 2. Take smallest distance in case of duplications.
    # Duplications will break merge method however will not break append method
    df = df.groupby(['Code 1', 'Code 2'], as_index=False)
    df = df.aggregate(np.min)
    return df

# Generate n data frames each with m rows in a list
def generate_data_frames(n, m, with_count=False):
    df_list = []
    for k in range(0, n):
        df = generate_data_frame(m)
        # Add count column, needed for merge method to keep track of how many cases we have seen
        if with_count:
            df['Count'] = 1
        df_list.append(df)
    return df_list

Append method (faster, shorter, nicer)

df_list = generate_data_frames(94, 5)

# Append all data frames together using reduce
df_append = reduce(lambda df_1, df_2 : df_1.append(df_2), df_list)

# Aggregate by Code 1 and Code 2
df_append_grouped = df_append.groupby(['Code 1', 'Code 2'], as_index=False)
df_append_result = df_append_grouped.aggregate(np.mean)
df_append_result

Merge method

df_list = generate_data_frames(94, 5, with_count=True)

# Function to be passed to reduce. Merge 2 data frames and update Distance and Count
def merge_dfs(df_1, df_2):
    df = pandas.merge(df_1, df_2, on=['Code 1', 'Code 2'], how='outer', suffixes=('', '_y'))
    df = df.fillna(0)
    df['Distance'] = df['Distance'] + df['Distance_y']
    df['Count'] = df['Count'] + df['Count_y']
    del df['Distance_y']
    del df['Count_y']
    return df

# Use reduce to apply merge over the list of data frames
df_merge_result = reduce(merge_dfs, df_list)

# Replace distance with its mean and drop Count
df_merge_result['Distance'] = df_merge_result['Distance'] / df_merge_result['Count']
del df_merge_result['Count']
df_merge_result

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM