简体   繁体   English

如何在R中按小时对变量进行百分位数?

[英]How to percentile variables by hour in R?

I have a task I need to execute in R. I've done it in python (likely not in the most efficient way.) The end goal: A dataframe with columns start_time, agent, percentile.我有一个需要在 R 中执行的任务。我已经在 python 中完成了(可能不是以最有效的方式)。最终目标:一个包含 start_time、agent、percentile 列的数据框。 There are ~8200 agents and the business is open from 7:00 through 23:00, this is annotated by integer (7,8,...23).大约有 8200 个代理,营业时间为 7:00 到 23:00,用整数 (7,8,...23) 注释。 I need to "re-percentile" these agents by hour.我需要按小时“重新百分比”这些代理。

start_time, agent, percentile
7,          1,     1,
7,          2,     0.99,
...
7,          8200,  0,
...
23,         700,   1,
23,         12,    0.99     

Notice that (A) every agent:hour combination will be represented with its normalized score.请注意 (A) 每个 agent:hour 组合都将以其归一化分数表示。 For reference, this normalization formula is (x-min)/(max-min) .作为参考,这个归一化公式是(x-min)/(max-min)

The data that I currently have looks like this.我目前拥有的数据如下所示。 Table A (metrics.csv)表 A (metrics.csv)

idx,  agent,          percentile
1,    z_agent[1],     1
2,    z_agent[2],     0.05
3,    z_agent[3],     0.5
...
8200, z_agent[8200],  0.99

Table B (hours.csv)表 B(小时.csv)

agent_idx,  start_hour
1           7
2           7
3           7
4           7

python code:蟒蛇代码:

hours = pd.read_csv('hours.csv')
metrics = pd.read_csv('metrics.csv')

ag_rank = {row['agent']:row['percentile'] for idx,row in metrics.iterrows() if 'agent' in row[0]}
raw_scores = [s for s in ag_rank.values()]
raw_min = min(raw_scores)
raw_max = max(raw_scores)

def normed(x,mn,mx):
    return (x-mn)/(mx-mn)

norm_ag_scores = [normed(x,raw_min,raw_max) for x in raw_scores]

c = 0
for k,v in ag_rank.items():
    n = norm_ag_scores[c]
    ag_rank[k] = n
    c += 1

import operator
tups = []
starts = sorted([hr for hr in hours['start_hour'].unique()]) 
for hr in starts:
    agents = [f'z_agent[{a}]' for a in hours[hours['start_hour'] == hr]['agent_idx'].unique()]
    a_set = set(agents)
    b_set = set(ag_rank.keys())
    missing = list(a_set.symmetric_difference(b_set))
    scores = [ag_rank[a] for a in agents if a in ag_rank.keys()]
    hi = max(scores)
    low = min(scores)
    sort = {a:normed(s,low,hi) for a,s in zip(agents,scores)}
    sort = sorted(sort.items(),key=operator.itemgetter(1),reverse=True)
    for a,s in sort:
        tups.append((hr,a,s))
    for m in missing:
        tups.append((hr,m,0))

And the final table, in the form that I need it:决赛桌,以我需要的形式:

reperc = pd.DataFrame(data=tups,columns=['hour','agent','percentile'])
reperc.head()

>>>
7   z_agent[2853]   1.000000
7   z_agent[6004]   0.855892
7   z_agent[4366]   0.821758
7   z_agent[1742]   0.370188
7   z_agent[21]     0.000000

My questions are (A): How should I accomplish this affect in R?我的问题是(A):我应该如何在 R 中实现这种影响? And (B, optional): What/is there a way to accomplish this effect in python?和(B,可选):什么/有没有办法在python中实现这种效果? Perhaps a join would help.也许加入会有所帮助。

Something like this should work.像这样的事情应该有效。 Happy to test/debug if you share reproducible data.如果您共享可重现的数据,很高兴测试/调试。

library(dplyr)
metrics %>% 
  left_join(hours, by = c("idx" = "agent_idx")) %>%
  group_by(start_time) %>%
  mutate(
    new_percentile = (percentile - min(percentile)) / (max(percentile) - min(percentile))
  ) %>%
  arrange(start_time, desc(new_percentile))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM