在 R 中创建分类变量，按组计算中位数

Question

Here example of my data这里是我的数据示例

dput(mydat)

structure(list(ID.group = c(NA, 10150591L, NA, 10150591L, NA, 
10150591L, NA, 68837296L, NA, 68837296L, NA, 68837296L, NA, 124771228L, 
NA, 124771228L), UserID = c(NA, 181078814L, NA, 88578209L, NA, 
30240768L, NA, 334686951L, NA, 297170412L, NA, 265332359L, NA, 
216632504L, NA, 5272133L), countlike = c(NA, 44L, NA, 50L, NA, 
99L, NA, 1L, NA, 1L, NA, 15L, NA, 41L, NA, 20L), statistics.snt = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("", 
"fb"), class = "factor"), statistics.created_at = structure(c(1L, 
8L, 1L, 4L, 1L, 7L, 1L, 2L, 1L, 2L, 1L, 5L, 1L, 3L, 1L, 6L), .Label = c("", 
"10.04.2020 9:14", "11.04.2020 0:01", "11.04.2020 19:22", "12.04.2020 19:45", 
"12.04.2020 6:54", "13.04.2020 20:47", "17.04.2020 23:02"), class = "factor"), 
    statistics.updated_at = structure(c(1L, 8L, 1L, 7L, 1L, 6L, 
    1L, 3L, 1L, 3L, 1L, 4L, 1L, 5L, 1L, 2L), .Label = c("", "22.04.2020 12:27", 
    "22.04.2020 12:51", "22.04.2020 14:19", "22.04.2020 5:41", 
    "22.04.2020 6:18", "22.04.2020 7:37", "30.04.2020 16:55"), class = "factor"), 
    statistics.is_recount = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("", "False"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-16L))

I want calculate the median for countlike by ID group我想按 ID 组计算 countlike 的中位数

library(psych)
describeBy(mydat,mydat$ID.group)

but i didn't get needed result, i get all descriptive statistics.但我没有得到需要的结果，我得到了所有的描述性统计数据。 How can i get results like我怎样才能得到这样的结果

ID group    median countlike
10150591    50
68837296    1

Then how calculate categorical variable for UserID?那么如何计算 UserID 的分类变量呢？ For example.例如。 Median for ID group =10150591 is 50, then if userid =30240768 has value by countlike on 25% more than the median of this group then "red". ID group =10150591 的中位数为 50，则如果userid =30240768 的计数值比该组的中位数高 25%，则为“红色”。 =50/100*25=12.5 25% percentage from 50=12.5. =50/100*25=12.5 50=12.5 的 25% 百分比。 So 50+12.5=62.5 , If userid =30240768 has value more then 62.5 by countlike then "red" ie userid= 30240768 has value 99. so he is "red".所以50+12.5=62.5 ，如果userid =30240768 的值大于 countlike 的 62.5 ，那么“red”即userid= 30240768 的值是 99。所以他是“red”。 If userid has value on 25% less than the median by this group then "green".如果userid的值比该组的中位数低 25%，则为“绿色”。 50-12.5=37.5 , here not such value. 50-12.5=37.5 ，这里不是这样的值。 And last, if value in range ±24% from median for group then "orange".最后，如果值在组中位数的 ±24% 范围内，则为“橙色”。 24% from 50 = 50/100*24=12 , so if userid has value by countlike 50 ± 12 (38-62) then "orange". 24% from 50 = 50/100*24=12 ，所以如果userid的值是 countlike 50 ± 12 (38-62)那么“橙色”。 So desired output所以想要的output

ID group    UserID  countlike   median countlike
10150591    181078814   44  orange
10150591    88578209    50  orange
10150591    30240768    99  red
68837296    334686951   1   green
68837296    297170412   1   green
68837296    265332359   15  red

How do I comply with such conditions?我如何遵守这些条件？

Answer 1

Here is an answer using dplyr .这是使用dplyr的答案。 We aggregate the data to medians, merge the medians with the original data, and then calculate color .我们将数据聚合为中位数，将中位数与原始数据合并，然后计算color 。

First, we read the dput() data from the OP and remove rows that are missing.首先，我们从 OP 读取dput()数据并删除丢失的行。

data <- structure(list(ID.group = c(NA, 10150591L, NA, 10150591L, NA, 
                            10150591L, NA, 68837296L, NA, 68837296L, NA, 68837296L, NA, 124771228L, 
                            NA, 124771228L), UserID = c(NA, 181078814L, NA, 88578209L, NA, 
                                                        30240768L, NA, 334686951L, NA, 297170412L, NA, 265332359L, NA, 
                                                        216632504L, NA, 5272133L), countlike = c(NA, 44L, NA, 50L, NA, 
                                                                                                 99L, NA, 1L, NA, 1L, NA, 15L, NA, 41L, NA, 20L), statistics.snt = structure(c(1L, 
                                                                                                                                                                               2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("", 
                                                                                                                                                                                                                                                       "fb"), class = "factor"), statistics.created_at = structure(c(1L, 
                                                                                                                                                                                                                                                                                                                     8L, 1L, 4L, 1L, 7L, 1L, 2L, 1L, 2L, 1L, 5L, 1L, 3L, 1L, 6L), .Label = c("", 
                                                                                                                                                                                                                                                                                                                                                                                             "10.04.2020 9:14", "11.04.2020 0:01", "11.04.2020 19:22", "12.04.2020 19:45", 
                                                                                                                                                                                                                                                                                                                                                                                             "12.04.2020 6:54", "13.04.2020 20:47", "17.04.2020 23:02"), class = "factor"), 
               statistics.updated_at = structure(c(1L, 8L, 1L, 7L, 1L, 6L, 
                                                   1L, 3L, 1L, 3L, 1L, 4L, 1L, 5L, 1L, 2L), .Label = c("", "22.04.2020 12:27", 
                                                                                                       "22.04.2020 12:51", "22.04.2020 14:19", "22.04.2020 5:41", 
                                                                                                       "22.04.2020 6:18", "22.04.2020 7:37", "30.04.2020 16:55"), class = "factor"), 
               statistics.is_recount = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 
                                                   1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("", "False"
                                                   ), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                              -16L))

data <- data[!is.na(data$ID.group),]

Next, we load dplyr and calculate the desired output.接下来，我们加载dplyr并计算所需的 output。

library(dplyr)
data %>% group_by(ID.group) %>%
     summarise(.,mdn_countlike = median(countlike)) %>%
     inner_join(.,data) %>%
     mutate(color = case_when(countlike > 1.25 * mdn_countlike ~ "red",
                              countlike < 0.75 * mdn_countlike ~ "green",
                              countlike >= 0.75 * mdn_countlike & 
                                   countlike <= 1.25 * mdn_countlike ~ "orange")) -> mergedData

mergedData[,c("ID.group","UserID","countlike","mdn_countlike","color")]

...and the output: ...和 output：

> mergedData[,c("ID.group","UserID","countlike","mdn_countlike","color")]
# A tibble: 8 x 5
   ID.group    UserID countlike mdn_countlike color 
      <int>     <int>     <int>         <dbl> <chr> 
1  10150591 181078814        44          50   orange
2  10150591  88578209        50          50   orange
3  10150591  30240768        99          50   red   
4  68837296 334686951         1           1   orange
5  68837296 297170412         1           1   orange
6  68837296 265332359        15           1   red   
7 124771228 216632504        41          30.5 red   
8 124771228   5272133        20          30.5 green 
>

在 R 中创建分类变量，按组计算中位数

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-05-21 10:47:11

在 R 中创建分类变量，按组计算中位数

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-05-21 10:47:11

解决方案1
2 已采纳 2020-05-21 10:47:11