[英]How to accumulate unique sum of columns across pandas index
I have a pandas DateFrame, df which I created with 我有一个用它创建的熊猫DateFrame,df
df = pd.read_table('sorted_df_changes.txt', index_col=0, parse_dates=True, names=['date', 'rev_id', 'score'])
which is structured like so: 其结构如下:
page_id score
date
2001-05-23 19:50:14 2430 7.632989
2001-05-25 11:53:55 1814033 18.946234
2001-05-27 17:36:37 2115 3.398154
2001-08-04 21:00:51 311 19.386016
2001-08-04 21:07:42 314 14.886722
date is the index and is of type DatetimeIndex. date是索引,类型为DatetimeIndex。
Every page_id may appear in one or more dates (not unique) and is large in size ~1 million. 每个page_id可能会出现在一个或多个日期中(不是唯一的),并且大小约为100万。 All of the pages together make up the document .
所有页面共同构成了文档 。
I need to get a score for the entire document at every time in date while only counting the latest score for any given page_id. 我需要在每个时间获取整个文档的分数,而只计算任何给定的page_id的最新分数。
page_id score
date
2001-05-23 19:50:14 1 3
2001-05-25 11:53:55 2 4
2001-05-27 17:36:37 1 5
2001-05-28 19:36:37 1 1
score
date
2001-05-23 19:50:14 3
2001-05-25 11:53:55 7 (3 + 4)
2001-05-27 17:36:37 9 (5 + 4)
2001-05-28 19:36:37 5 (1 + 4)
The entry for 2 is counted continuously since it is not repeated but each time id 1 is repeated the new score replaces the old score. 因为不重复输入2,所以条目被连续计数,但是每次重复id 1时,新得分将替换旧得分。
Edit : 编辑 :
Finally, I found a solution that don't need for loop: 最后,我找到了不需要循环的解决方案:
df.score.groupby(df.page_id).transform(lambda s:s.diff().combine_first(s)).cumsum()
I think a for loop is needed: 我认为需要for循环:
from StringIO import StringIO
txt = """date,page_id,score
2001-05-23 19:50:14, 1,3
2001-05-25 11:53:55, 2,4
2001-05-27 17:36:37, 1,5
2001-05-28 19:36:37, 1,1
2001-05-28 19:36:38, 3,6
2001-05-28 19:36:39, 3,9
"""
df = pd.read_csv(StringIO(txt), index_col=0)
def score_sum_py(page_id, scores):
from itertools import izip
score_sum = 0
last_score = [0]*(np.max(page_id)+1)
result = np.empty_like(scores)
for i, (pid, score) in enumerate(izip(page_id, scores)):
score_sum = score_sum - last_score[pid] + score
last_score[pid] = score
result[i] = score_sum
result.name = "score_sum"
return result
print score_sum_py(pd.factorize(df.page_id)[0], df.score)
output: 输出:
date
2001-05-23 19:50:14 3
2001-05-25 11:53:55 7
2001-05-27 17:36:37 9
2001-05-28 19:36:37 5
2001-05-28 19:36:38 11
2001-05-28 19:36:39 14
Name: score_sum
If the loop in python is slow, you can try to convert the two series page_id, scores to python list first, loop over list and calculation with python's native integer maybe faster. 如果python中的循环很慢,则可以尝试转换两个系列的page_id,首先将分数转换为python列表,然后遍历列表并使用python的本机整数进行计算可能会更快。
If speed is important, you can also try cython: 如果速度很重要,您也可以尝试cython:
%%cython
cimport cython
cimport numpy as np
import numpy as np
@cython.wraparound(False)
@cython.boundscheck(False)
def score_sum(np.ndarray[int] page_id, np.ndarray[long long] scores):
cdef int i
cdef long long score_sum, pid, score
cdef np.ndarray[long long] last_score, result
score_sum = 0
last_score = np.zeros(np.max(page_id)+1, dtype=np.int64)
result = np.empty_like(scores)
for i in range(len(page_id)):
pid = page_id[i]
score = scores[i]
score_sum = score_sum - last_score[pid] + score
last_score[pid] = score
result[i] = score_sum
result.name = "score_sum"
return result
Here I use pandas.factorize()
to convert the page_id
to an array in range 0 and N. where N is the unique count of elements in page_id
. 在这里,我使用
pandas.factorize()
将page_id
转换为0和N范围内的数组。其中N是page_id
中元素的唯一计数。 You can also use a dict to cache the last_score of every page_id without using pandas.factorize()
. 您也可以使用字典来缓存每个page_id的last_score,而无需使用
pandas.factorize()
。
An alternative datastructure makes this calculation easier to reason about, performance won't be as good as other answers, but I thought worth mentioning (mainly because it uses my favourite pandas function...) : 替代的数据结构使此计算更容易推论,性能将不如其他答案那么好,但是我想值得一提(主要是因为它使用了我最喜欢的pandas函数...) :
In [11]: scores = pd.get_dummies(df['page_id']).mul(df['score'], axis=0).where(x!=0, np.nan)
In [12]: scores
Out[12]:
1 2 3
date
2001-05-23 19:50:14 3 NaN NaN
2001-05-25 11:53:55 NaN 4 NaN
2001-05-27 17:36:37 5 NaN NaN
2001-05-28 19:36:37 1 NaN NaN
2001-05-28 19:36:38 NaN NaN 6
2001-05-28 19:36:39 NaN NaN 9
In [13]: scores.ffill()
Out[13]:
1 2 3
date
2001-05-23 19:50:14 3 NaN NaN
2001-05-25 11:53:55 3 4 NaN
2001-05-27 17:36:37 5 4 NaN
2001-05-28 19:36:37 1 4 NaN
2001-05-28 19:36:38 1 4 6
2001-05-28 19:36:39 1 4 9
In [14]: scores.ffill().sum(axis=1)
Out[14]:
date
2001-05-23 19:50:14 3
2001-05-25 11:53:55 7
2001-05-27 17:36:37 9
2001-05-28 19:36:37 5
2001-05-28 19:36:38 11
2001-05-28 19:36:39 14
Is this what you want? 这是你想要的吗? But I think it's a stupid solution.
但是我认为这是一个愚蠢的解决方案。
In [164]: df['result'] = [df[:i+1].groupby('page_id').last().sum()[0] for i in range(len(df))]
In [165]: df
Out[165]:
page_id score result
date
2001-05-23 19:50:14 1 3 3
2001-05-25 11:53:55 2 4 7
2001-05-27 17:36:37 1 5 9
2001-05-28 19:36:37 1 1 5
Here is an interim solution I put together using the standard library. 这是我使用标准库汇总的临时解决方案。 I would like to see an elegant efficient solution using pandas.
我希望看到一个使用熊猫的优雅有效的解决方案。
import csv
from collections import defaultdict
page_scores = defaultdict(lambda: 0)
date_scores = [] # [(date, score)]
def get_and_update_score_diff(page_id, new_score):
diff = new_score - page_scores[page_id]
page_scores[page_id] = new_score
return diff
# Note: there are some duplicate dates and the file is sorted by date.
# Format: 2001-05-23T19:50:14Z, 2430, 7.632989
with open('sorted_df_changes.txt') as f:
reader = csv.reader(f, delimiter='\t')
first = reader.next()
date_string, page_id, score = first[0], first[1], float(first[2])
page_scores[page_id] = score
date_scores.append((date_string, score))
for date_string, page_id, score in reader:
score = float(score)
score_diff = get_and_update_score_diff(page_id, score)
if date_scores[-1][0] == date_string:
date_scores[-1] = (date_string, date_scores[-1][1] + score_diff)
else:
date_scores.append((date_string, date_scores[-1][1] + score_diff))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.