[英]Pandas: sum if columns values coincides
我正在嘗試在python中執行一個簡單的操作:
我有一些數據集,例如6,如果其他兩列的值重合,我想對一列的值求和。 之后,我想將已求和的列的值除以我擁有的數據集數量,在這種情況下為6(即計算算術平均值)。 另外,如果其他列的值不一致,我想求和0。
我在這里寫下兩個數據框,例如:
Code1 Code2 Distance 0 15.0 15.0 2 1 15.0 60.0 3 2 15.0 69.0 2 3 15.0 434.0 1 4 15.0 842.0 0
Code1 Code2 Distance 0 14.0 15.0 4 1 14.0 60.0 7 2 15.0 15.0 0 3 15.0 60.0 1 4 15.0 69.0 9
第一列是df.index列。 然后,僅當“ Code1”和“ Code2”列重合時,我想對“ Distance”列進行求和。 在這種情況下,所需的輸出將類似於:
Code1 Code2 Distance 0 14.0 15.0 2 1 14.0 60.0 3.5 2 15.0 15.0 1 3 15.0 60.0 2 4 15.0 69.0 5.5 5 15.0 434.0 0.5 6 15.0 842.0 0
我嘗試使用條件條件來執行此操作,但是對於兩個以上的df來說確實很難做到。 熊貓有沒有辦法更快地做到這一點?
任何幫助,將不勝感激 :-)
您可以將所有數據框放在列表中,然后使用reduce
append
或merge
它們。 在這里看看reduce。
首先,下面定義了一些用於樣本數據生成的功能。
import pandas
import numpy as np
# GENERATE DATA
# Code 1 between 13 and 15
def generate_code_1(n):
return np.floor(np.random.rand(n,1) * 3 + 13)
# Code 2 between 1 and 1000
def generate_code_2(n):
return np.floor(np.random.rand(n,1) * 1000) + 1
# Distance between 0 and 9
def generate_distance(n):
return np.floor(np.random.rand(n,1) * 10)
# Generate a data frame as hstack of 3 arrays
def generate_data_frame(n):
data = np.hstack([
generate_code_1(n)
,generate_code_2(n)
,generate_distance(n)
])
df = pandas.DataFrame(data=data, columns=['Code 1', 'Code 2', 'Distance'])
# Remove possible duplications of Code 1 and Code 2. Take smallest distance in case of duplications.
# Duplications will break merge method however will not break append method
df = df.groupby(['Code 1', 'Code 2'], as_index=False)
df = df.aggregate(np.min)
return df
# Generate n data frames each with m rows in a list
def generate_data_frames(n, m, with_count=False):
df_list = []
for k in range(0, n):
df = generate_data_frame(m)
# Add count column, needed for merge method to keep track of how many cases we have seen
if with_count:
df['Count'] = 1
df_list.append(df)
return df_list
追加方法(更快,更短,更好)
df_list = generate_data_frames(94, 5)
# Append all data frames together using reduce
df_append = reduce(lambda df_1, df_2 : df_1.append(df_2), df_list)
# Aggregate by Code 1 and Code 2
df_append_grouped = df_append.groupby(['Code 1', 'Code 2'], as_index=False)
df_append_result = df_append_grouped.aggregate(np.mean)
df_append_result
合並方式
df_list = generate_data_frames(94, 5, with_count=True)
# Function to be passed to reduce. Merge 2 data frames and update Distance and Count
def merge_dfs(df_1, df_2):
df = pandas.merge(df_1, df_2, on=['Code 1', 'Code 2'], how='outer', suffixes=('', '_y'))
df = df.fillna(0)
df['Distance'] = df['Distance'] + df['Distance_y']
df['Count'] = df['Count'] + df['Count_y']
del df['Distance_y']
del df['Count_y']
return df
# Use reduce to apply merge over the list of data frames
df_merge_result = reduce(merge_dfs, df_list)
# Replace distance with its mean and drop Count
df_merge_result['Distance'] = df_merge_result['Distance'] / df_merge_result['Count']
del df_merge_result['Count']
df_merge_result
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.