簡體   English   中英

通過基於兩列隨機選擇行來對數據進行子集

[英]Subset data by randomly selecting rows based on two columns

我有一個很大的data.frame,我想創建一個新的data.frame,其中基於兩列隨機選擇行。

有90個唯一的elkID,每個FixDate約有48行。 我想創建一個新的data.frame,其中包含90個唯一的elkID,每個FixDate隨機選擇4行。

數據如下所示:

> head(df)
elkID        X       Y       Fix.Date.Time    FixDate
1   245 550345.1 4826676 2010-02-24 10:00:58 2010-02-24
2   245 550217.9 4826519 2010-02-24 10:30:47 2010-02-24
3   245 550066.3 4826478 2010-02-24 11:00:41 2010-02-24
4   245 549912.6 4826419 2010-02-24 11:30:48 2010-02-24
5   245 549977.3 4826438 2010-02-24 12:00:55 2010-02-24
6   245 549795.1 4826294 2010-02-24 12:30:29 2010-02-24

我希望它看起來像這樣(每個FixDate每個唯一的elkID 4行):

> df2
elkID        X       Y       Fix.Date.Time    FixDate
1   245 550345.1 4826676 2010-02-24 10:00:58 2010-02-24
2   245 550217.9 4826519 2010-02-24 10:30:47 2010-02-24
3   245 550066.3 4826478 2010-02-24 11:00:41 2010-02-24
4   245 549912.6 4826419 2010-02-24 11:30:48 2010-02-24
5   245 549977.3 4826438 2010-02-24 12:00:55 2010-02-25
6   245 549795.1 4826294 2010-02-24 12:30:29 2010-02-25

使用RStudio V0.99.467和R3.2.1

如果要遍歷它們,可以嘗試以下操作:

# initialize a new dataframe to store new data
newdf = NULL    

# extract unique elk IDs
IDs = unique(df$elkID)

# create a loop to subset each ID first (i loop) and secondly
# loop through the unique dates (j loop)
for(i in 1:length(IDs)){
  data1 = df[df$elkID == IDs[i],]
  dates = unique(data1$FixDate)
  for(j in 1:length(dates)){
    data2 = data1[data1$FixDate == dates[j],]
    # this should select 4 rows at random for each particular ID and date
    data2 = data2[sample(1:nrow(data2),4),]
    newdf = rbind(newdf,data2)
  }
}

head(newdf)
tail(newdf)

這是您想要的嗎?

對於大型數據data.table我建議使用data.table軟件包:

library(data.table)
setDT(df)
df[, .SD[sample(.N, 4)] , by=.(elkID, FixDate)] #or
df[, .SD[sample(.N, 4)] , keyby=.(elkID, FixDate)]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM