[英]R Show empty groups with cut()
I have a set of data: 我有一组数据:
Abweichung BW_Gesamt
76 236 1137747
77 2000 1149019
78 2000 1227972
79 2331 1346480
80 4000 2226810
81 5272 2874114
82 8585 4418070
83 15307 5389585
Now I want to group them. 现在,我想将它们分组。 The difficulty is that I can apply flexible breaks, by entering MIN/MAX of x-Axis and amount of groups.
困难在于,我可以通过输入x轴的MIN / MAX和组数来应用灵活的休息时间。 So it will cut the data into groups that are "MYSCHRTW" wide:
因此,它将把数据切成“ MYSCHRTW”宽的组:
bins <- 4 # Amount of groups
MYMIN <- 0
MYMAX <- 20000
MYSCHRTW <- (-MYMIN+MYMAX)%/%bins # Wide of one group 5000
GRENZEN <- seq(from = MYMIN, by = MYSCHRTW, length.out = bins)
GRENZEN <- c(GRENZEN, MYMAX+1) #Brakes: 0 5000 10000 15000 20001
I use the cut function: 我使用剪切功能:
setDT(mydata)[ , Gruppen := cut(mydata$Abweichung,breaks=GRENZEN,dig.lab = 5)]
The problem is, that one group is missing, because it is empty and so not being displayed. 问题是缺少一个组,因为它是空的,因此没有显示。 Plotting the data without that group can bias the result So how can I add group (10000,15000], with Abweichung and BW_Gesamt 0:
在没有该组的情况下绘制数据会使结果产生偏差。因此,如何使用Abweichung和BW_Gesamt 0添加组(10000,15000]:
Abweichung BW_Gesamt Gruppen
1: 236 1137747 (0,5000]
2: 2000 1149019 (0,5000]
3: 2000 1227972 (0,5000]
4: 2331 1346480 (0,5000]
5: 4000 2226810 (0,5000]
6: 5272 2874114 (5000,10000]
7: 8585 4418070 (5000,10000]
8: 15307 5389585 (15000,20001]
Ok I don't know if it's efficient but there is a way : 好的,我不知道它是否有效,但是有一种方法:
library(data.table)
The data you work on : 您处理的数据:
mydata <- data.table(Abweichung = c(236,2000,2000,2331,4000,5272,8585,15307),
BW_Gesamt = c(1137747,1149019,1227972,1346480,2226810,2874114,4418070,5389585))
> mydata
Abweichung BW_Gesamt
1: 236 1137747
2: 2000 1149019
3: 2000 1227972
4: 2331 1346480
5: 4000 2226810
6: 5272 2874114
7: 8585 4418070
8: 15307 5389585
First create a data.table
that contains all the groups from cut()
: 首先创建一个
data.table
,其中包含来自cut()
所有组:
groups_cut <- data.table(Gruppen = levels(cut(mydata[, Abweichung],breaks=GRENZEN,dig.lab = 5)))
> groups_cut
Gruppen
1: (0,5000]
2: (5000,10000]
3: (10000,15000]
4: (15000,20001]
Then a second data.table
in which you count the number of occurrences by the variable Gruppen
: 然后是第二个
data.table
,其中您通过变量Gruppen
计算出现的次数:
mydata <- mydata[ , Gruppen := cut(mydata[, Abweichung],breaks=GRENZEN,dig.lab = 5)][, .N, by = Gruppen]
Gruppen N
1: (0,5000] 5
2: (5000,10000] 2
3: (15000,20001] 1
Now you can merge the two data.table
: 现在您可以合并两个
data.table
:
merge_dt<- mydata[groups_cut, on = "Gruppen"]
> merge_dt
Gruppen N
1: (0,5000] 5
2: (5000,10000] 2
3: (10000,15000] NA
4: (15000,20001] 1
If you don't want to keep the NA
value, you can add a little syntax after the merge : 如果您不想保留
NA
值,则可以在合并之后添加一些语法:
merge_dt <- mydata[groups_cut, on = "Gruppen"][, N := replace(N, is.na(N), 0)]
> merge_dt
Gruppen N
1: (0,5000] 5
2: (5000,10000] 2
3: (10000,15000] 0
4: (15000,20001] 1
I guess I found an answer by myself: So continue at my initial post at: 我想我自己找到了答案:因此,请继续在我的第一篇文章中:
setDT(mydata)[ , Gruppen := cut(mydata$Abweichung,breaks=GRENZEN,dig.lab = 5)]
> print(mydata)
Abweichung BW_Gesamt Gruppen
1: 236 1137747 (0,5000]
2: 2000 1149019 (0,5000]
3: 2000 1227972 (0,5000]
4: 2331 1346480 (0,5000]
5: 4000 2226810 (0,5000]
6: 5272 2874114 (5000,10000]
7: 8585 4418070 (5000,10000]
8: 15307 5389585 (15000,20000]
> class(mydata$Abweichung)
[1] "numeric"
> class(mydata$BW_Gesamt)
[1] "numeric"
library(dplyr)
mydata <- levels(mydata$Gruppen) %>% #get distinct levels of the Gruppen variable
data.frame(Gruppen = .) %>% # create a data frame
left_join(mydata %>% # join with
group_by(Gruppen) %>% # for each value that exists
summarise(Abweichung = n(), BW_Gesamt = sum(BW_Gesamt)), by = "Gruppen") %>% # get occurrence of Abweichung and sum of BW_Gesamt just for fun
mutate(Abweichung = coalesce(Abweichung, 0L)) %>% # replace NAs with 0s
mutate(BW_Gesamt = coalesce(as.integer(BW_Gesamt), 0L))
> class(mydata$Abweichung)
[1] "integer"
> class(mydata$BW_Gesamt)
[1] "integer"
> print(mydata)
Gruppen Abweichung BW_Gesamt
1 (0,5000] 5 7088028
2 (5000,10000] 2 7292184
3 (10000,15000] 0 0
4 (15000,20000] 1 5389585
There is a difference in mutate Abweichung and mutate BW_Gesamt, because I found out that Abweichung will be changed to integer, while BW_Gesamt remains numeric. 突变Abweichung和mutate BW_Gesamt有区别,因为我发现Abweichung将更改为整数,而BW_Gesamt仍为数字。
I don't know how efficient this method is, I found it here: LINK Thanks to AntoniosK 我不知道这种方法的效率如何,我在这里找到了: LINK感谢AntoniosK
Maybe someone has an idea how it could be optimized. 也许有人对如何优化它有所了解。 In my opinion it has the advantage of changing the result of the groups.
我认为它具有更改组结果的优势。 So I can show the sum of BW_Gesamt while showing the number of occurrence of Abweichung at the same time.
因此,我可以显示BW_Gesamt的总和,同时显示Abweichung的出现次数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.