简体   繁体   English

R用cut()显示空组

[英]R Show empty groups with cut()

I have a set of data: 我有一组数据:

   Abweichung BW_Gesamt
76        236   1137747
77       2000   1149019
78       2000   1227972
79       2331   1346480
80       4000   2226810
81       5272   2874114
82       8585   4418070
83      15307   5389585

Now I want to group them. 现在,我想将它们分组。 The difficulty is that I can apply flexible breaks, by entering MIN/MAX of x-Axis and amount of groups. 困难在于,我可以通过输入x轴的MIN / MAX和组数来应用灵活的休息时间。 So it will cut the data into groups that are "MYSCHRTW" wide: 因此,它将把数据切成“ MYSCHRTW”宽的组:

bins <- 4 # Amount of groups
MYMIN <- 0
MYMAX <- 20000
MYSCHRTW <- (-MYMIN+MYMAX)%/%bins # Wide of one group 5000
GRENZEN <- seq(from = MYMIN, by = MYSCHRTW, length.out = bins)
GRENZEN <- c(GRENZEN, MYMAX+1) #Brakes: 0 5000 10000 15000 20001

I use the cut function: 我使用剪切功能:

setDT(mydata)[ , Gruppen := cut(mydata$Abweichung,breaks=GRENZEN,dig.lab = 5)]

The problem is, that one group is missing, because it is empty and so not being displayed. 问题是缺少一个组,因为它是空的,因此没有显示。 Plotting the data without that group can bias the result So how can I add group (10000,15000], with Abweichung and BW_Gesamt 0: 在没有该组的情况下绘制数据会使结果产生偏差。因此,如何使用Abweichung和BW_Gesamt 0添加组(10000,15000]:

   Abweichung BW_Gesamt       Gruppen
1:        236   1137747      (0,5000]
2:       2000   1149019      (0,5000]
3:       2000   1227972      (0,5000]
4:       2331   1346480      (0,5000]
5:       4000   2226810      (0,5000]
6:       5272   2874114  (5000,10000]
7:       8585   4418070  (5000,10000]
8:      15307   5389585 (15000,20001]

Ok I don't know if it's efficient but there is a way : 好的,我不知道它是否有效,但是有一种方法:

library(data.table)

The data you work on : 您处理的数据:

mydata <- data.table(Abweichung = c(236,2000,2000,2331,4000,5272,8585,15307),
                     BW_Gesamt = c(1137747,1149019,1227972,1346480,2226810,2874114,4418070,5389585))


> mydata
   Abweichung BW_Gesamt
1:        236   1137747
2:       2000   1149019
3:       2000   1227972
4:       2331   1346480
5:       4000   2226810
6:       5272   2874114
7:       8585   4418070
8:      15307   5389585

First create a data.table that contains all the groups from cut() : 首先创建一个data.table ,其中包含来自cut()所有组:

groups_cut <- data.table(Gruppen = levels(cut(mydata[, Abweichung],breaks=GRENZEN,dig.lab = 5)))

> groups_cut
         Gruppen
1:      (0,5000]
2:  (5000,10000]
3: (10000,15000]
4: (15000,20001]

Then a second data.table in which you count the number of occurrences by the variable Gruppen : 然后是第二个data.table ,其中您通过变量Gruppen计算出现的次数:

mydata <- mydata[ , Gruppen := cut(mydata[, Abweichung],breaks=GRENZEN,dig.lab = 5)][, .N, by = Gruppen]

         Gruppen N
1:      (0,5000] 5
2:  (5000,10000] 2
3: (15000,20001] 1

Now you can merge the two data.table : 现在您可以合并两个data.table

merge_dt<- mydata[groups_cut, on = "Gruppen"]

> merge_dt
         Gruppen  N
1:      (0,5000]  5
2:  (5000,10000]  2
3: (10000,15000] NA
4: (15000,20001]  1

If you don't want to keep the NA value, you can add a little syntax after the merge : 如果您不想保留NA值,则可以在合并之后添加一些语法:

merge_dt <- mydata[groups_cut, on = "Gruppen"][, N := replace(N, is.na(N), 0)]

> merge_dt
         Gruppen N
1:      (0,5000] 5
2:  (5000,10000] 2
3: (10000,15000] 0
4: (15000,20001] 1

I guess I found an answer by myself: So continue at my initial post at: 我想我自己找到了答案:因此,请继续在我的第一篇文章中:

setDT(mydata)[ , Gruppen := cut(mydata$Abweichung,breaks=GRENZEN,dig.lab = 5)]
> print(mydata)
   Abweichung BW_Gesamt       Gruppen
1:        236   1137747      (0,5000]
2:       2000   1149019      (0,5000]
3:       2000   1227972      (0,5000]
4:       2331   1346480      (0,5000]
5:       4000   2226810      (0,5000]
6:       5272   2874114  (5000,10000]
7:       8585   4418070  (5000,10000]
8:      15307   5389585 (15000,20000]

> class(mydata$Abweichung)
[1] "numeric"
> class(mydata$BW_Gesamt)
[1] "numeric"

library(dplyr)

mydata <- levels(mydata$Gruppen) %>%  #get distinct levels of the Gruppen variable
  data.frame(Gruppen = .) %>%  # create a data frame
  left_join(mydata %>%    # join with
              group_by(Gruppen) %>%    # for each value that exists
              summarise(Abweichung = n(), BW_Gesamt = sum(BW_Gesamt)), by = "Gruppen") %>%      # get occurrence of Abweichung and sum of BW_Gesamt just for fun 
  mutate(Abweichung = coalesce(Abweichung, 0L)) %>%  # replace NAs with 0s
  mutate(BW_Gesamt = coalesce(as.integer(BW_Gesamt), 0L))

> class(mydata$Abweichung)
[1] "integer"
> class(mydata$BW_Gesamt)
[1] "integer"

> print(mydata)
        Gruppen Abweichung BW_Gesamt
1      (0,5000]          5   7088028
2  (5000,10000]          2   7292184
3 (10000,15000]          0         0
4 (15000,20000]          1   5389585

There is a difference in mutate Abweichung and mutate BW_Gesamt, because I found out that Abweichung will be changed to integer, while BW_Gesamt remains numeric. 突变Abweichung和mutate BW_Gesamt有区别,因为我发现Abweichung将更改为整数,而BW_Gesamt仍为数字。

I don't know how efficient this method is, I found it here: LINK Thanks to AntoniosK 我不知道这种方法的效率如何,我在这里找到了: LINK感谢AntoniosK

Maybe someone has an idea how it could be optimized. 也许有人对如何优化它有所了解。 In my opinion it has the advantage of changing the result of the groups. 我认为它具有更改组结果的优势。 So I can show the sum of BW_Gesamt while showing the number of occurrence of Abweichung at the same time. 因此,我可以显示BW_Gesamt的总和,同时显示Abweichung的出现次数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM