[英]How to percentile variables by hour in R?
I have a task I need to execute in R. I've done it in python (likely not in the most efficient way.) The end goal: A dataframe with columns start_time, agent, percentile.我有一个需要在 R 中执行的任务。我已经在 python 中完成了(可能不是以最有效的方式)。最终目标:一个包含 start_time、agent、percentile 列的数据框。 There are ~8200 agents and the business is open from 7:00 through 23:00, this is annotated by integer (7,8,...23).
大约有 8200 个代理,营业时间为 7:00 到 23:00,用整数 (7,8,...23) 注释。 I need to "re-percentile" these agents by hour.
我需要按小时“重新百分比”这些代理。
start_time, agent, percentile
7, 1, 1,
7, 2, 0.99,
...
7, 8200, 0,
...
23, 700, 1,
23, 12, 0.99
Notice that (A) every agent:hour combination will be represented with its normalized score.请注意 (A) 每个 agent:hour 组合都将以其归一化分数表示。 For reference, this normalization formula is
(x-min)/(max-min)
.作为参考,这个归一化公式是
(x-min)/(max-min)
。
The data that I currently have looks like this.我目前拥有的数据如下所示。 Table A (metrics.csv)
表 A (metrics.csv)
idx, agent, percentile
1, z_agent[1], 1
2, z_agent[2], 0.05
3, z_agent[3], 0.5
...
8200, z_agent[8200], 0.99
Table B (hours.csv)表 B(小时.csv)
agent_idx, start_hour
1 7
2 7
3 7
4 7
python code:蟒蛇代码:
hours = pd.read_csv('hours.csv')
metrics = pd.read_csv('metrics.csv')
ag_rank = {row['agent']:row['percentile'] for idx,row in metrics.iterrows() if 'agent' in row[0]}
raw_scores = [s for s in ag_rank.values()]
raw_min = min(raw_scores)
raw_max = max(raw_scores)
def normed(x,mn,mx):
return (x-mn)/(mx-mn)
norm_ag_scores = [normed(x,raw_min,raw_max) for x in raw_scores]
c = 0
for k,v in ag_rank.items():
n = norm_ag_scores[c]
ag_rank[k] = n
c += 1
import operator
tups = []
starts = sorted([hr for hr in hours['start_hour'].unique()])
for hr in starts:
agents = [f'z_agent[{a}]' for a in hours[hours['start_hour'] == hr]['agent_idx'].unique()]
a_set = set(agents)
b_set = set(ag_rank.keys())
missing = list(a_set.symmetric_difference(b_set))
scores = [ag_rank[a] for a in agents if a in ag_rank.keys()]
hi = max(scores)
low = min(scores)
sort = {a:normed(s,low,hi) for a,s in zip(agents,scores)}
sort = sorted(sort.items(),key=operator.itemgetter(1),reverse=True)
for a,s in sort:
tups.append((hr,a,s))
for m in missing:
tups.append((hr,m,0))
And the final table, in the form that I need it:决赛桌,以我需要的形式:
reperc = pd.DataFrame(data=tups,columns=['hour','agent','percentile'])
reperc.head()
>>>
7 z_agent[2853] 1.000000
7 z_agent[6004] 0.855892
7 z_agent[4366] 0.821758
7 z_agent[1742] 0.370188
7 z_agent[21] 0.000000
My questions are (A): How should I accomplish this affect in R?我的问题是(A):我应该如何在 R 中实现这种影响? And (B, optional): What/is there a way to accomplish this effect in python?
和(B,可选):什么/有没有办法在python中实现这种效果? Perhaps a join would help.
也许加入会有所帮助。
Something like this should work.像这样的事情应该有效。 Happy to test/debug if you share reproducible data.
如果您共享可重现的数据,很高兴测试/调试。
library(dplyr)
metrics %>%
left_join(hours, by = c("idx" = "agent_idx")) %>%
group_by(start_time) %>%
mutate(
new_percentile = (percentile - min(percentile)) / (max(percentile) - min(percentile))
) %>%
arrange(start_time, desc(new_percentile))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.