[英]Randomly assign sample into groups in R
我有一個大數據集,其中包含來自不同城市的每個人的一些人口統計信息。 我想創建一個變量(例如類),將城市內同一年齡組的個人分配到大約 20(~15-25)人的組中。 這是一個 R 代碼,用於生成我的數據示例:
set.seed(10)
ID = seq(1:10000)
df <- as.data.frame(ID)
df$City <- cut(runif(10000, 0,100),breaks = c(0,7,20,35,47,55,61,74,85,91,100),include.lowest = T,right = F, labels = c("City 1","City 2","City 3","City 4","City 5","City 6","City 7","City 8","City 9","City 10"))
df$Age_Group <- cut(runif(10000, 0,100),breaks = c(0,10,20,30,40,50,60,70,80,90,101),include.lowest = T,right = F, labels = c("0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-89","90+"))
table(df$Age_Group,df$City)
我希望df$class
將相似年齡組和城市的個人分組。 class 的值需要在所有年齡組和城市之后繼續。 我怎樣才能做到這一點?
謝謝
使用toString
。
df$class <- factor(apply(df[c("City", "Age_Group")], 1, toString))
levels(df$class)
# [1] "City 1, 0-9" "City 1, 10-19" "City 1, 20-29" "City 1, 30-39"
# [5] "City 1, 40-49" "City 1, 50-59" "City 1, 60-69" "City 1, 70-79"
# [9] "City 1, 80-89" "City 1, 90+" "City 10, 0-9" "City 10, 10-19"
# [13] "City 10, 20-29" [...]
要獲得隨機樣本,您可以by
"class"
將數據集拆分為子集,例如s
,並計算當您將nrow(s)/20
(個人)除以 20 時獲得的組數。可能使用ceiling
小數點數,比如x
,然后利用 R 的回收屬性; 使用cbind
將1:ceiling(x)
綁定到s
並讓它循環到nrow(s)
,在那里我們可以安全地suppressWarnings
警告。 當然,我們現在想使用sample
來擾亂順序,只需要列[,2]
。 最后使用rownames
do.call(rbind(.))
來取消拆分數據集,如果需要,可以刪除行名。
set.seed(1) ## for sake of reproducibility
df <- `rownames<-`(do.call(rbind, by(df, df$class, function(s)
transform(s, SAMP=suppressWarnings(
sample(cbind(s$class, SAMP=1:ceiling(nrow(s)/20))[,2])
)))), NULL)
結果:
產生"SAMP"
列,每個"class"
具有大約 20 個成員的大小大致相等的組。
df[60:70, ] ##example rows
# ID City Age_Group class SAMP
# 60 8766 City 01 0-9 City 01, 0-9 4
# 61 8775 City 01 0-9 City 01, 0-9 1
# 62 9021 City 01 0-9 City 01, 0-9 3
# 63 9041 City 01 0-9 City 01, 0-9 3
# 64 9482 City 01 0-9 City 01, 0-9 1
# 65 9622 City 01 0-9 City 01, 0-9 1
# 66 47 City 01 10-19 City 01, 10-19 4
# 67 698 City 01 10-19 City 01, 10-19 3
# 68 833 City 01 10-19 City 01, 10-19 1
# 69 1166 City 01 10-19 City 01, 10-19 1
# 70 1221 City 01 10-19 City 01, 10-19 2
使用其 SAMPles 檢查類的前十個表:
by(df$SAMP, df$class, table)[1:10]
# $`City 01, 0-9`
#
# 1 2 3 4
# 17 16 16 16
#
# $`City 01, 10-19`
#
# 1 2 3 4
# 18 17 17 17
#
# $`City 01, 20-29`
#
# 1 2 3 4
# 18 18 17 17
#
# $`City 01, 30-39`
#
# 1 2 3 4
# 19 19 19 19
#
# $`City 01, 40-49`
#
# 1 2 3 4
# 19 19 19 18
#
# $`City 01, 50-59`
#
# 1 2 3 4 5
# 18 17 17 17 17
#
# $`City 01, 60-69`
#
# 1 2 3 4
# 16 16 16 16
#
# $`City 01, 70-79`
#
# 1 2 3 4
# 19 19 19 19
#
# $`City 01, 80-89`
#
# 1 2 3 4
# 20 19 19 19
#
# $`City 01, 90+`
#
# 1 2 3 4
# 18 17 17 17
如果您想要 class 編號而不是全部編號,只需將"class"
(作為數字)和"SAMP"
paste
在一起。
df <- transform(df, SAMP2=paste(as.numeric(class), SAMP, sep="."))
head(df)
# ID City Age_Group class SAMP SAMP2
# 1 193 City 01 0-9 City 01, 0-9 3 1.3
# 2 480 City 01 0-9 City 01, 0-9 1 1.1
# 3 742 City 01 0-9 City 01, 0-9 2 1.2
# 4 757 City 01 0-9 City 01, 0-9 1 1.1
# 5 811 City 01 0-9 City 01, 0-9 3 1.3
# 6 870 City 01 0-9 City 01, 0-9 3 1.3
caret
package 可以幫助您解決這個問題。 考慮到輸入的不平衡性質,它會嘗試創建 n 個分區,同時尊重Age
和City
等類別,它不會是完美的。 但是你可以選擇分區的數量(又名折疊),看看什么適合你的需要,我選擇了 5 個。
require(caret)
#> Loading required package: caret
#> Loading required package: lattice
#> Loading required package: ggplot2
set.seed(10)
ID = seq(1:10000)
df <- as.data.frame(ID)
df$City <- cut(runif(10000, 0,100),breaks = c(0,7,20,35,47,55,61,74,85,91,100),include.lowest = T,right = F, labels = c("City 1","City 2","City 3","City 4","City 5","City 6","City 7","City 8","City 9","City 10"))
df$Age_Group <- cut(runif(10000, 0,100),breaks = c(0,10,20,30,40,50,60,70,80,90,101),include.lowest = T,right = F, labels = c("0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-89","90+"))
# table(df$Age_Group, df$City)
df$class <- caret::createFolds(df$Age_Group,
5,
FALSE)
table(df$class, df$City, df$Age_Group)
#> , , = 0-9
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 18 27 28 29 15 8 22 21 9 21
#> 2 16 29 31 27 9 10 19 23 12 22
#> 3 12 20 26 26 20 11 30 22 12 18
#> 4 9 27 24 28 13 12 24 31 12 17
#> 5 10 22 36 31 13 13 23 24 11 15
#>
#> , , = 10-19
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 13 22 13 22 11 9 38 18 22 23
#> 2 12 23 34 21 13 7 26 22 16 16
#> 3 14 25 30 25 13 7 30 23 11 12
#> 4 13 29 31 19 22 17 23 16 9 11
#> 5 17 22 24 23 18 20 22 15 9 20
#>
#> , , = 20-29
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 14 28 31 24 12 10 35 22 12 14
#> 2 9 32 22 29 15 9 30 19 18 19
#> 3 18 35 25 17 14 13 22 18 19 21
#> 4 15 26 33 25 11 15 37 20 1 19
#> 5 14 20 31 32 12 14 23 16 18 21
#>
#> , , = 30-39
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 13 28 29 22 24 14 24 19 18 21
#> 2 15 28 31 32 19 14 21 25 16 12
#> 3 17 30 28 22 20 9 22 29 14 21
#> 4 18 26 33 23 10 16 23 24 13 26
#> 5 13 26 40 24 12 8 25 21 20 23
#>
#> , , = 40-49
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 16 26 41 16 19 13 19 18 16 22
#> 2 18 23 36 32 8 12 28 15 16 18
#> 3 19 27 29 23 11 16 33 13 15 21
#> 4 13 21 30 29 18 18 26 19 9 23
#> 5 9 34 27 27 17 9 27 22 11 23
#>
#> , , = 50-59
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 21 28 28 21 15 10 25 26 21 8
#> 2 12 17 24 25 20 20 25 32 14 13
#> 3 19 27 35 30 10 8 19 24 13 17
#> 4 19 23 30 23 19 11 19 25 16 18
#> 5 15 37 38 18 10 15 23 25 9 13
#>
#> , , = 60-69
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 12 29 31 25 14 15 12 27 11 20
#> 2 12 22 29 25 18 14 22 20 11 24
#> 3 11 27 30 21 15 16 22 23 15 16
#> 4 17 21 32 20 12 12 24 28 11 19
#> 5 12 27 37 31 11 11 17 16 17 18
#>
#> , , = 70-79
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 10 23 27 36 13 7 29 20 13 17
#> 2 25 19 27 27 18 8 25 17 10 20
#> 3 12 17 27 26 13 5 34 24 14 23
#> 4 12 28 34 22 15 8 28 21 14 13
#> 5 17 30 40 23 13 11 21 17 7 16
#>
#> , , = 80-89
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 10 27 26 34 17 16 23 19 8 16
#> 2 17 19 33 16 19 19 16 31 12 14
#> 3 14 24 27 23 14 10 25 23 12 23
#> 4 12 25 30 33 14 16 19 14 12 20
#> 5 24 24 25 26 20 6 18 20 13 20
#>
#> , , = 90+
#>
#>
#> City 1 City 2 City 3 City 4 City 5 City 6 City 7 City 8 City 9 City 10
#> 1 16 21 30 25 20 15 31 23 10 11
#> 2 15 25 34 28 16 13 25 19 10 17
#> 3 12 23 30 26 19 14 24 23 13 18
#> 4 13 30 30 24 15 10 23 25 14 18
#> 5 13 16 24 24 23 17 30 23 18 15
由代表 package (v0.3.0) 於 2020 年 5 月 8 日創建
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.