如何在R中按小时对变量进行百分位数？

Question

I have a task I need to execute in R. I've done it in python (likely not in the most efficient way.) The end goal: A dataframe with columns start_time, agent, percentile.我有一个需要在 R 中执行的任务。我已经在 python 中完成了（可能不是以最有效的方式）。最终目标：一个包含 start_time、agent、percentile 列的数据框。 There are ~8200 agents and the business is open from 7:00 through 23:00, this is annotated by integer (7,8,...23).大约有 8200 个代理，营业时间为 7:00 到 23:00，用整数 (7,8,...23) 注释。 I need to "re-percentile" these agents by hour.我需要按小时“重新百分比”这些代理。

start_time, agent, percentile
7,          1,     1,
7,          2,     0.99,
...
7,          8200,  0,
...
23,         700,   1,
23,         12,    0.99

Notice that (A) every agent:hour combination will be represented with its normalized score.请注意 (A) 每个 agent:hour 组合都将以其归一化分数表示。 For reference, this normalization formula is (x-min)/(max-min) .作为参考，这个归一化公式是(x-min)/(max-min) 。

The data that I currently have looks like this.我目前拥有的数据如下所示。 Table A (metrics.csv)表 A (metrics.csv)

idx,  agent,          percentile
1,    z_agent[1],     1
2,    z_agent[2],     0.05
3,    z_agent[3],     0.5
...
8200, z_agent[8200],  0.99

Table B (hours.csv)表 B（小时.csv）

agent_idx,  start_hour
1           7
2           7
3           7
4           7

python code:蟒蛇代码：

hours = pd.read_csv('hours.csv')
metrics = pd.read_csv('metrics.csv')

ag_rank = {row['agent']:row['percentile'] for idx,row in metrics.iterrows() if 'agent' in row[0]}
raw_scores = [s for s in ag_rank.values()]
raw_min = min(raw_scores)
raw_max = max(raw_scores)

def normed(x,mn,mx):
    return (x-mn)/(mx-mn)

norm_ag_scores = [normed(x,raw_min,raw_max) for x in raw_scores]

c = 0
for k,v in ag_rank.items():
    n = norm_ag_scores[c]
    ag_rank[k] = n
    c += 1

import operator
tups = []
starts = sorted([hr for hr in hours['start_hour'].unique()]) 
for hr in starts:
    agents = [f'z_agent[{a}]' for a in hours[hours['start_hour'] == hr]['agent_idx'].unique()]
    a_set = set(agents)
    b_set = set(ag_rank.keys())
    missing = list(a_set.symmetric_difference(b_set))
    scores = [ag_rank[a] for a in agents if a in ag_rank.keys()]
    hi = max(scores)
    low = min(scores)
    sort = {a:normed(s,low,hi) for a,s in zip(agents,scores)}
    sort = sorted(sort.items(),key=operator.itemgetter(1),reverse=True)
    for a,s in sort:
        tups.append((hr,a,s))
    for m in missing:
        tups.append((hr,m,0))

And the final table, in the form that I need it:决赛桌，以我需要的形式：

reperc = pd.DataFrame(data=tups,columns=['hour','agent','percentile'])
reperc.head()

>>>
7   z_agent[2853]   1.000000
7   z_agent[6004]   0.855892
7   z_agent[4366]   0.821758
7   z_agent[1742]   0.370188
7   z_agent[21]     0.000000

My questions are (A): How should I accomplish this affect in R?我的问题是（A）：我应该如何在 R 中实现这种影响？ And (B, optional): What/is there a way to accomplish this effect in python?和（B，可选）：什么/有没有办法在python中实现这种效果？ Perhaps a join would help.也许加入会有所帮助。

Answer 1

Something like this should work.像这样的事情应该有效。 Happy to test/debug if you share reproducible data.如果您共享可重现的数据，很高兴测试/调试。

library(dplyr)
metrics %>% 
  left_join(hours, by = c("idx" = "agent_idx")) %>%
  group_by(start_time) %>%
  mutate(
    new_percentile = (percentile - min(percentile)) / (max(percentile) - min(percentile))
  ) %>%
  arrange(start_time, desc(new_percentile))

如何在R中按小时对变量进行百分位数？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-05 19:49:30

如何在R中按小时对变量进行百分位数？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-05 19:49:30

解决方案1
1 已采纳 2020-10-05 19:49:30