简体   繁体   English

如何找到百分位数然后在R中分组

[英]How to find percentile and then group in R

I have a data frame like below (df). 我有一个像下面的数据框(DF)。

day  area   hour  time  count
___  ____  _____  ___   ____
 1    1      0     1     10
 1    1      0     2     12
 1    1      0     3     8
 1    1      0     4     12    
 1    1      0     5     15  
 1    1      0     6     18 
 1    1      1     1     10
 1    1      1     2     12
 1    1      1     3     8
 1    1      1     4     12    
 1    1      1     5     15  
 1    1      1     6     18
 1    1      1     7     12    
 1    1      1     8     15  
 1    1      1     9     18
 1    1      2     1     10    
 1    1      2     2     18  
 1    1      2     3     19
 .....
 2    1      0     1     18
 2    1      0     2     12
 2    1      0     3     18
 2    1      0     4     12    
 2    1      1     1     8
 2    1      1     2     12
 2    1      1     3     18
 2    1      1     4     10    
 2    1      1     5     15  
 2    1      1     6     18
 2    1      1     7     12    
 2    1      1     8     15  
 2    1      1     9     18
 2    1      2     1     10    
 2    1      2     2     18  
 2    1      2     3     19
 2    1      2     4     9    
 2    1      2     5     18  
 2    1      2     6     9


..... 
 30    99      23     1     9    
 30    99      23     2     8  
 30    99      23     3     9
 30    99      23     4     19    
 30    99      23     5     18  
 30    99      23     6     9
 30    99      23     7     19    
 30    99      23     8     8  
 30    99      23     9     19

Here I have the data for 30 days for 87 areas (1 to 82 and then I have 90,93,95,97,99) and 24 hours (0 to 23) per day.So the data is about the time taken to cross the area and how many have crossed. 在这里,我有87个区域(1到82,然后有90,93,95,97,99)和每天24小时(0到23)的30天数据,所以该数据大约是穿越时间该地区以及有多少人穿越。

For example: 例如:

day  area   hour  time  count
___  ____  _____  ___   ____
 1    1      0     1     10
 1    1      0     2     12
 1    1      0     3     8
 1    1      0     4     12    
 1    1      0     5     15  
 1    1      0     6     18 

This gives me the On day 1 on hour 0 the time taken to cross the area 1 这使我在0小时的第1天穿越区域1所花费的时间

time  count   cumulative_count
___    ___    ________________
 1     10           10
 2     12           22
 3     8            30
 4     12           42    
 5     15           57
 6     18           75 
10 vehicles crossed the area in 1 minute.
12 vehicles crossed the area in 2 minutes.
8 vehicles crossed the area in 3 minutes.
12 vehicles crossed the area in 4 minutes.
15 vehicles crossed the area in 5 minutes.
18 vehicles crossed the area in 6 minutes.

From this I want to calculate How much time it took for 80% of the vehicles to cross area 1 in day 1 hour 0.So total vehicles=(10+12+8+12+15+18)=75.So 80% of 75 is 60.So time taken for 80% of the vehicles(80% of 75 which is 60) to pass the area 1 at day 1 hour 0 will be between 5 and 6(will be nearer to 5). 据此,我想计算出80%的车辆在1小时0内穿越区域1花费了多少时间,所以车辆总数=(10 + 12 + 8 + 12 + 15 + 18)= 75,因此80% 75中的60是60,因此80%的车辆(75中的80%为60)在第1天0时通过区域1所花费的时间将介于5到6之间(将接近5)。 So the result will be like: 因此结果将是这样的:

 day  area   hour    time_taken_for_80%vehicles_to_pass
    ___  ____   ____    ___________________________________
     1    1      0                5.33(approximately)
     1    1      1                7.30
     1    1      2                2.16
    ....
     30   1      23               3.13
     1    2      0                ---
     1    2      1                ---
     1    2      2                ---
     1    2      3                ---

 .......

     30    99     21              ---
     30    99     22              ---
     30    99     23              ---

   I know to I have to take quantile and then group by the area and day and hour.So I tried with 

library(dplyr)
grp <- group_by(df, day,area,hour,quantile(df$count,0.8))

But it does not work.Any help is appreciated 但这是行不通的,请多多帮助

My solution calculates the percentage of vehicles that crossed the area, for each time . 我的解决方案的计算,交叉领域,各车辆的百分比time Then gets the first time the percentage is above 80%: 然后得到第一time的百分比为80%以上:

str <- 'day  area   hour  time  count
1    1      0     1     10
1    1      0     2     12
1    1      0     3     8
1    1      0     4     12    
1    1      0     5     15  
1    1      0     6     18
1    1      1     1     10
1    1      1     2     12
1    1      1     3     8
1    1      1     4     12    
1    1      1     5     15  
1    1      1     6     18
1    1      1     7     12    
1    1      1     8     15  
1    1      1     9     18
1    1      2     1     10    
1    1      2     2     18  
1    1      2     3     19'



file <- textConnection(str)
df <- read.table(file, header = T)

df

library(dplyr)
df %>% group_by(day, area, hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount)) %>%
  filter(p > 0.8) %>%
  summarise(time = min(time))

result: 结果:

    day  area  hour  time
  <int> <int> <int> <int>
1     1     1     0     6
2     1     1     1     8
3     1     1     2     3

Or with a linear estimation of the time when 80% is reached: 或对达到80%的时间进行线性估计:

df %>% group_by(day, area, hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount),
         g = +(p > 0.8),
         order = (g*2-1)*time) %>%
  group_by(day, area, hour,g) %>%
  filter(row_number((g*2-1)*time)==1) %>%
  group_by(day, area, hour) %>%
  summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))

result: 结果:

    day  area  hour     time
  <int> <int> <int>    <dbl>
1     1     1     0 5.166667
2     1     1     1 7.600000
3     1     1     2 2.505263

or get the same result using lag and lead 或使用laglead得到相同的结果

df %>% group_by(day, area, hour) %>%
  arrange(hour) %>%
  mutate(cumcount = cumsum(count),
         p = cumcount/max(cumcount)) %>%
  filter((p >= 0.8&lag(p)<0.8)|(p < 0.8&lead(p)>=0.8)) %>%
  summarise(time = min(time)+(0.8-min(p))/(max(p)-min(p)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM