简体   繁体   English

在 R 中创建分类变量,按组计算中位数

[英]calculate median by groups with creating categorical variable in R

Here example of my data这里是我的数据示例

dput(mydat)

structure(list(ID.group = c(NA, 10150591L, NA, 10150591L, NA, 
10150591L, NA, 68837296L, NA, 68837296L, NA, 68837296L, NA, 124771228L, 
NA, 124771228L), UserID = c(NA, 181078814L, NA, 88578209L, NA, 
30240768L, NA, 334686951L, NA, 297170412L, NA, 265332359L, NA, 
216632504L, NA, 5272133L), countlike = c(NA, 44L, NA, 50L, NA, 
99L, NA, 1L, NA, 1L, NA, 15L, NA, 41L, NA, 20L), statistics.snt = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("", 
"fb"), class = "factor"), statistics.created_at = structure(c(1L, 
8L, 1L, 4L, 1L, 7L, 1L, 2L, 1L, 2L, 1L, 5L, 1L, 3L, 1L, 6L), .Label = c("", 
"10.04.2020 9:14", "11.04.2020 0:01", "11.04.2020 19:22", "12.04.2020 19:45", 
"12.04.2020 6:54", "13.04.2020 20:47", "17.04.2020 23:02"), class = "factor"), 
    statistics.updated_at = structure(c(1L, 8L, 1L, 7L, 1L, 6L, 
    1L, 3L, 1L, 3L, 1L, 4L, 1L, 5L, 1L, 2L), .Label = c("", "22.04.2020 12:27", 
    "22.04.2020 12:51", "22.04.2020 14:19", "22.04.2020 5:41", 
    "22.04.2020 6:18", "22.04.2020 7:37", "30.04.2020 16:55"), class = "factor"), 
    statistics.is_recount = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("", "False"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-16L))

I want calculate the median for countlike by ID group我想按 ID 组计算 countlike 的中位数

library(psych)
describeBy(mydat,mydat$ID.group)

but i didn't get needed result, i get all descriptive statistics.但我没有得到需要的结果,我得到了所有的描述性统计数据。 How can i get results like我怎样才能得到这样的结果

ID group    median countlike
10150591    50
68837296    1

Then how calculate categorical variable for UserID?那么如何计算 UserID 的分类变量呢? For example.例如。 Median for ID group =10150591 is 50, then if userid =30240768 has value by countlike on 25% more than the median of this group then "red". ID group =10150591 的中位数为 50,则如果userid =30240768 的计数值比该组的中位数高 25%,则为“红色”。 =50/100*25=12.5 25% percentage from 50=12.5. =50/100*25=12.5 50=12.5 的 25% 百分比。 So 50+12.5=62.5 , If userid =30240768 has value more then 62.5 by countlike then "red" ie userid= 30240768 has value 99. so he is "red".所以50+12.5=62.5 ,如果userid =30240768 的值大于 countlike 的 62.5 ,那么“red”即userid= 30240768 的值是 99。所以他是“red”。 If userid has value on 25% less than the median by this group then "green".如果userid的值比该组的中位数低 25%,则为“绿色”。 50-12.5=37.5 , here not such value. 50-12.5=37.5 ,这里不是这样的值。 And last, if value in range ±24% from median for group then "orange".最后,如果值在组中位数的 ±24% 范围内,则为“橙色”。 24% from 50 = 50/100*24=12 , so if userid has value by countlike 50 ± 12 (38-62) then "orange". 24% from 50 = 50/100*24=12 ,所以如果userid的值是 countlike 50 ± 12 (38-62)那么“橙色”。 So desired output所以想要的output

ID group    UserID  countlike   median countlike
10150591    181078814   44  orange
10150591    88578209    50  orange
10150591    30240768    99  red
68837296    334686951   1   green
68837296    297170412   1   green
68837296    265332359   15  red

How do I comply with such conditions?我如何遵守这些条件?

Here is an answer using dplyr .这是使用dplyr的答案。 We aggregate the data to medians, merge the medians with the original data, and then calculate color .我们将数据聚合为中位数,将中位数与原始数据合并,然后计算color

First, we read the dput() data from the OP and remove rows that are missing.首先,我们从 OP 读取dput()数据并删除丢失的行。

data <- structure(list(ID.group = c(NA, 10150591L, NA, 10150591L, NA, 
                            10150591L, NA, 68837296L, NA, 68837296L, NA, 68837296L, NA, 124771228L, 
                            NA, 124771228L), UserID = c(NA, 181078814L, NA, 88578209L, NA, 
                                                        30240768L, NA, 334686951L, NA, 297170412L, NA, 265332359L, NA, 
                                                        216632504L, NA, 5272133L), countlike = c(NA, 44L, NA, 50L, NA, 
                                                                                                 99L, NA, 1L, NA, 1L, NA, 15L, NA, 41L, NA, 20L), statistics.snt = structure(c(1L, 
                                                                                                                                                                               2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("", 
                                                                                                                                                                                                                                                       "fb"), class = "factor"), statistics.created_at = structure(c(1L, 
                                                                                                                                                                                                                                                                                                                     8L, 1L, 4L, 1L, 7L, 1L, 2L, 1L, 2L, 1L, 5L, 1L, 3L, 1L, 6L), .Label = c("", 
                                                                                                                                                                                                                                                                                                                                                                                             "10.04.2020 9:14", "11.04.2020 0:01", "11.04.2020 19:22", "12.04.2020 19:45", 
                                                                                                                                                                                                                                                                                                                                                                                             "12.04.2020 6:54", "13.04.2020 20:47", "17.04.2020 23:02"), class = "factor"), 
               statistics.updated_at = structure(c(1L, 8L, 1L, 7L, 1L, 6L, 
                                                   1L, 3L, 1L, 3L, 1L, 4L, 1L, 5L, 1L, 2L), .Label = c("", "22.04.2020 12:27", 
                                                                                                       "22.04.2020 12:51", "22.04.2020 14:19", "22.04.2020 5:41", 
                                                                                                       "22.04.2020 6:18", "22.04.2020 7:37", "30.04.2020 16:55"), class = "factor"), 
               statistics.is_recount = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 
                                                   1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("", "False"
                                                   ), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                              -16L))

data <- data[!is.na(data$ID.group),]

Next, we load dplyr and calculate the desired output.接下来,我们加载dplyr并计算所需的 output。

library(dplyr)
data %>% group_by(ID.group) %>%
     summarise(.,mdn_countlike = median(countlike)) %>%
     inner_join(.,data) %>%
     mutate(color = case_when(countlike > 1.25 * mdn_countlike ~ "red",
                              countlike < 0.75 * mdn_countlike ~ "green",
                              countlike >= 0.75 * mdn_countlike & 
                                   countlike <= 1.25 * mdn_countlike ~ "orange")) -> mergedData

mergedData[,c("ID.group","UserID","countlike","mdn_countlike","color")]

...and the output: ...和 output:

> mergedData[,c("ID.group","UserID","countlike","mdn_countlike","color")]
# A tibble: 8 x 5
   ID.group    UserID countlike mdn_countlike color 
      <int>     <int>     <int>         <dbl> <chr> 
1  10150591 181078814        44          50   orange
2  10150591  88578209        50          50   orange
3  10150591  30240768        99          50   red   
4  68837296 334686951         1           1   orange
5  68837296 297170412         1           1   orange
6  68837296 265332359        15           1   red   
7 124771228 216632504        41          30.5 red   
8 124771228   5272133        20          30.5 green 
>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM