[英]R Show empty groups with cut()
我有一組數據:
Abweichung BW_Gesamt
76 236 1137747
77 2000 1149019
78 2000 1227972
79 2331 1346480
80 4000 2226810
81 5272 2874114
82 8585 4418070
83 15307 5389585
現在,我想將它們分組。 困難在於,我可以通過輸入x軸的MIN / MAX和組數來應用靈活的休息時間。 因此,它將把數據切成“ MYSCHRTW”寬的組:
bins <- 4 # Amount of groups
MYMIN <- 0
MYMAX <- 20000
MYSCHRTW <- (-MYMIN+MYMAX)%/%bins # Wide of one group 5000
GRENZEN <- seq(from = MYMIN, by = MYSCHRTW, length.out = bins)
GRENZEN <- c(GRENZEN, MYMAX+1) #Brakes: 0 5000 10000 15000 20001
我使用剪切功能:
setDT(mydata)[ , Gruppen := cut(mydata$Abweichung,breaks=GRENZEN,dig.lab = 5)]
問題是缺少一個組,因為它是空的,因此沒有顯示。 在沒有該組的情況下繪制數據會使結果產生偏差。因此,如何使用Abweichung和BW_Gesamt 0添加組(10000,15000]:
Abweichung BW_Gesamt Gruppen
1: 236 1137747 (0,5000]
2: 2000 1149019 (0,5000]
3: 2000 1227972 (0,5000]
4: 2331 1346480 (0,5000]
5: 4000 2226810 (0,5000]
6: 5272 2874114 (5000,10000]
7: 8585 4418070 (5000,10000]
8: 15307 5389585 (15000,20001]
好的,我不知道它是否有效,但是有一種方法:
library(data.table)
您處理的數據:
mydata <- data.table(Abweichung = c(236,2000,2000,2331,4000,5272,8585,15307),
BW_Gesamt = c(1137747,1149019,1227972,1346480,2226810,2874114,4418070,5389585))
> mydata
Abweichung BW_Gesamt
1: 236 1137747
2: 2000 1149019
3: 2000 1227972
4: 2331 1346480
5: 4000 2226810
6: 5272 2874114
7: 8585 4418070
8: 15307 5389585
首先創建一個data.table
,其中包含來自cut()
所有組:
groups_cut <- data.table(Gruppen = levels(cut(mydata[, Abweichung],breaks=GRENZEN,dig.lab = 5)))
> groups_cut
Gruppen
1: (0,5000]
2: (5000,10000]
3: (10000,15000]
4: (15000,20001]
然后是第二個data.table
,其中您通過變量Gruppen
計算出現的次數:
mydata <- mydata[ , Gruppen := cut(mydata[, Abweichung],breaks=GRENZEN,dig.lab = 5)][, .N, by = Gruppen]
Gruppen N
1: (0,5000] 5
2: (5000,10000] 2
3: (15000,20001] 1
現在您可以合並兩個data.table
:
merge_dt<- mydata[groups_cut, on = "Gruppen"]
> merge_dt
Gruppen N
1: (0,5000] 5
2: (5000,10000] 2
3: (10000,15000] NA
4: (15000,20001] 1
如果您不想保留NA
值,則可以在合並之后添加一些語法:
merge_dt <- mydata[groups_cut, on = "Gruppen"][, N := replace(N, is.na(N), 0)]
> merge_dt
Gruppen N
1: (0,5000] 5
2: (5000,10000] 2
3: (10000,15000] 0
4: (15000,20001] 1
我想我自己找到了答案:因此,請繼續在我的第一篇文章中:
setDT(mydata)[ , Gruppen := cut(mydata$Abweichung,breaks=GRENZEN,dig.lab = 5)]
> print(mydata)
Abweichung BW_Gesamt Gruppen
1: 236 1137747 (0,5000]
2: 2000 1149019 (0,5000]
3: 2000 1227972 (0,5000]
4: 2331 1346480 (0,5000]
5: 4000 2226810 (0,5000]
6: 5272 2874114 (5000,10000]
7: 8585 4418070 (5000,10000]
8: 15307 5389585 (15000,20000]
> class(mydata$Abweichung)
[1] "numeric"
> class(mydata$BW_Gesamt)
[1] "numeric"
library(dplyr)
mydata <- levels(mydata$Gruppen) %>% #get distinct levels of the Gruppen variable
data.frame(Gruppen = .) %>% # create a data frame
left_join(mydata %>% # join with
group_by(Gruppen) %>% # for each value that exists
summarise(Abweichung = n(), BW_Gesamt = sum(BW_Gesamt)), by = "Gruppen") %>% # get occurrence of Abweichung and sum of BW_Gesamt just for fun
mutate(Abweichung = coalesce(Abweichung, 0L)) %>% # replace NAs with 0s
mutate(BW_Gesamt = coalesce(as.integer(BW_Gesamt), 0L))
> class(mydata$Abweichung)
[1] "integer"
> class(mydata$BW_Gesamt)
[1] "integer"
> print(mydata)
Gruppen Abweichung BW_Gesamt
1 (0,5000] 5 7088028
2 (5000,10000] 2 7292184
3 (10000,15000] 0 0
4 (15000,20000] 1 5389585
突變Abweichung和mutate BW_Gesamt有區別,因為我發現Abweichung將更改為整數,而BW_Gesamt仍為數字。
我不知道這種方法的效率如何,我在這里找到了: LINK感謝AntoniosK
也許有人對如何優化它有所了解。 我認為它具有更改組結果的優勢。 因此,我可以顯示BW_Gesamt的總和,同時顯示Abweichung的出現次數。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.