[英]How to sum a column over unique values in two other column with Pandas?
我需要在給定 Dataframe 中其他兩列的唯一值的情況下找到一列的總和。
模擬我正在嘗試做的示例代碼。
import numpy as np
import pandas as pd
def makeDictArray(**kwargs):
data = {}
size = kwargs.get('size',20)
strings = 'What is this sentence about? And why do you care?'.split(' ')
bools = [True,False]
bytestrings = list(map(lambda x:bytes(x,encoding='utf-8'),strings))
data['ByteString'] = np.random.choice(bytestrings, size)
data['Unicode'] = np.random.choice(strings, size)
data['Integer'] = np.random.randint(0,500,size)
data['Float'] = np.random.random(size)
data['Bool'] = np.random.choice(bools,size)
data['Time'] = np.random.randint(0,1500,size)
return data
def makeDF(**kwargs):
size = kwargs.get('size',20)
data = makeDictArray(size=size)
return pd.DataFrame(data)
x = makeDF(size=1000000)
x['SUM'] = 0.
xx = x.groupby(['Time','Integer'])['Float'].agg('sum')
這是xx:
Time Integer
0 0 0.826326
1 4.897836
2 5.238863
3 6.694214
4 6.791922
1499 495 5.621809
496 7.385356
497 4.755907
498 6.470006
499 3.634070
Name: Float, Length: 749742, dtype: float64
我試過的:
uniqueTimes = pd.unique(x['Time'])
for t in uniqueTimes:
for i in xx[t].index:
idx = (x['Time'] == t) & (x['Integer'] == i)
if idx.any():
x.loc[idx,'SUM'] = xx[t][i]
這給了我正確的結果,但我想將總和的值放回新創建的“SUM”列中的“x”中。 我可以通過執行雙循環來實現這一點,但是,這很慢,而且似乎不是“熊貓方式”。
有人有什么建議嗎?
如果我正確理解了問題,您希望在["Time", "Integer"]
上對x
和xx
進行標准合並。
x = makeDF(size=5)
xx = x.groupby(['Time','Integer'])['Float'].agg('sum')
pd.merge(x, xx.to_frame(name="SUM"), on=["Time", "Integer"])
輸出
ByteString Unicode Integer Float Bool Time SUM
0 b'about?' this 209 0.116809 False 1418 0.116809
1 b'why' is 12 0.043745 True 1159 0.043745
2 b'care?' care? 493 0.479680 False 487 0.479680
3 b'about?' about? 102 0.503759 False 335 0.503759
4 b'And' care? 197 0.394406 False 207 0.394406
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.