簡體   English   中英

熊貓:列值重合時相加

[英]Pandas: sum if columns values coincides

我正在嘗試在python中執行一個簡單的操作:

我有一些數據集,例如6,如果其他兩列的值重合,我想對一列的值求和。 之后,我想將已求和的列的值除以我擁有的數據集數量,在這種情況下為6(即計算算術平均值)。 另外,如果其他列的值不一致,我想求和0。

我在這里寫下兩個數據框,例如:

Code1 Code2 Distance 0 15.0 15.0 2 1 15.0 60.0 3 2 15.0 69.0 2 3 15.0 434.0 1 4 15.0 842.0 0

Code1 Code2 Distance 0 14.0 15.0 4 1 14.0 60.0 7 2 15.0 15.0 0 3 15.0 60.0 1 4 15.0 69.0 9

第一列是df.index列。 然后,僅當“ Code1”和“ Code2”列重合時,我想對“ Distance”列進行求和。 在這種情況下,所需的輸出將類似於:

Code1 Code2 Distance 0 14.0 15.0 2 1 14.0 60.0 3.5 2 15.0 15.0 1 3 15.0 60.0 2 4 15.0 69.0 5.5 5 15.0 434.0 0.5 6 15.0 842.0 0

我嘗試使用條件條件來執行此操作,但是對於兩個以上的df來說確實很難做到。 熊貓有沒有辦法更快地做到這一點?

任何幫助,將不勝感激 :-)

您可以將所有數據框放在列表中,然后使用reduce appendmerge它們。 這里看看reduce。

首先,下面定義了一些用於樣本數據生成的功能。

import pandas
import numpy as np

# GENERATE DATA
# Code 1 between 13 and 15
def generate_code_1(n):
    return np.floor(np.random.rand(n,1) * 3 + 13)

# Code 2 between 1 and 1000
def generate_code_2(n):
    return np.floor(np.random.rand(n,1) * 1000) + 1

# Distance between 0 and 9
def generate_distance(n):
    return np.floor(np.random.rand(n,1) * 10)

# Generate a data frame as hstack of 3 arrays
def generate_data_frame(n):
    data = np.hstack([
         generate_code_1(n)
        ,generate_code_2(n)
        ,generate_distance(n)
    ])
    df = pandas.DataFrame(data=data, columns=['Code 1', 'Code 2', 'Distance'])
    # Remove possible duplications of Code 1 and Code 2. Take smallest distance in case of duplications.
    # Duplications will break merge method however will not break append method
    df = df.groupby(['Code 1', 'Code 2'], as_index=False)
    df = df.aggregate(np.min)
    return df

# Generate n data frames each with m rows in a list
def generate_data_frames(n, m, with_count=False):
    df_list = []
    for k in range(0, n):
        df = generate_data_frame(m)
        # Add count column, needed for merge method to keep track of how many cases we have seen
        if with_count:
            df['Count'] = 1
        df_list.append(df)
    return df_list

追加方法(更快,更短,更好)

df_list = generate_data_frames(94, 5)

# Append all data frames together using reduce
df_append = reduce(lambda df_1, df_2 : df_1.append(df_2), df_list)

# Aggregate by Code 1 and Code 2
df_append_grouped = df_append.groupby(['Code 1', 'Code 2'], as_index=False)
df_append_result = df_append_grouped.aggregate(np.mean)
df_append_result

合並方式

df_list = generate_data_frames(94, 5, with_count=True)

# Function to be passed to reduce. Merge 2 data frames and update Distance and Count
def merge_dfs(df_1, df_2):
    df = pandas.merge(df_1, df_2, on=['Code 1', 'Code 2'], how='outer', suffixes=('', '_y'))
    df = df.fillna(0)
    df['Distance'] = df['Distance'] + df['Distance_y']
    df['Count'] = df['Count'] + df['Count_y']
    del df['Distance_y']
    del df['Count_y']
    return df

# Use reduce to apply merge over the list of data frames
df_merge_result = reduce(merge_dfs, df_list)

# Replace distance with its mean and drop Count
df_merge_result['Distance'] = df_merge_result['Distance'] / df_merge_result['Count']
del df_merge_result['Count']
df_merge_result

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM