具有不同条件的示例 data.table 行

Question

I have a data.table with multiple columns.我有一个包含多列的 data.table。 One of these columns currently works as a 'key' ( keyb for the example).这些列之一当前用作“键”（例如keyb ）。 Another column (let's say A ), may or may not have data in it.另一列（假设A ）中可能有也可能没有数据。 I would like to supply a vector that randomly sample two rows per key, -if this key appears in the vector, where 1 row contains data in A , while the other does not.我想提供一个向量，每个键随机采样两行， - 如果该键出现在向量中，其中 1 行包含A中的数据，而另一行不包含。

MRE:雷：

#data.table
trys <- structure(list(keyb = c("x", "x", "x", "x", "x", "y", "y", "y", 
"y", "y"), A = c("1", "", "1", "", "", "1", "", "", "1", "")), .Names = c("keyb", 
"A"), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
setkey(trys,keyb)

#list with keys
list_try <- structure(list(a = "x", b = c("r", "y","x")), .Names = c("a", "b"))

I could, for instance subset the data.table based on the elements that appear in list_try :例如，我可以根据data.table中出现的元素对data.table进行子集list_try ：

trys[keyb %in% list_try[[2]]]

My original (and probably inefficient idea), was to try to chain a sample of two rows per key, where the A column has data or no data, and then merge.我最初的（也可能是低效的想法）是尝试将每个键的两行样本链接起来，其中A列有数据或没有数据，然后合并。 But it does not work:但它不起作用：

#here I was trying to sample rows based on whether A has data or not
#here for rows where A has no data
trys[keyb %in% list_try[[2]]][nchar(A)==0][sample(.N, 2), ,by = keyb]
#here for rows where A has data
trys[keyb %in% list_try[[2]]][nchar(A)==1][sample(.N, 2), ,by = keyb]

In this case, my expected output would be two data.tables (one for a and one for b in list_try ), of two rows per appearing element: So the data.table from a would have two rows (one with and without data in A), and the one from b , four rows (two with and two without data in A).在这种情况下，我的预期输出将是两个 data.tables（一个用于a ，一个用于list_try b ），每个出现的元素有两行：所以来自a的 data.table 将有两行（一个有和没有数据） A) 和b的一个，四行（A 中有两行数据，两行没有数据）。

Please let me know if I can make this post any clearer如果我能让这篇文章更清晰，请告诉我

Answer 1

You could add A to the by statement too, while converting it to a binary vector by modifying to A != "" , combine with a binary join (while adding nomatch = 0L in order to remove non-matches) you could then sample from the row index .I by those two aggregators and then subset from the original data set您也可以将A添加到by语句中，同时通过修改为A != ""将其转换为二进制向量，并结合二进制连接（同时添加nomatch = 0L以删除不匹配项），然后您可以从中采样这两个聚合器的行索引.I然后是原始数据集的子集

For a single subset case对于单个子集情况

trys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]
#    keyb A
# 1:    y 1
# 2:    y  
# 3:    x 1
# 4:    x

For a more general case, when you want to create separate data sets according to a list of keys, you could easily embed this into lapply对于更一般的情况，当您想根据键列表创建单独的数据集时，您可以轻松地将其嵌入到lapply

lapply(list_try, 
       function(x) trys[trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]) 
# $a
# keyb A
# 1:    x 1
# 2:    x  
# 
# $b
# keyb A
# 1:    y 1
# 2:    y  
# 3:    x 1
# 4:    x

具有不同条件的示例 data.table 行

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-03-07 18:38:20

具有不同条件的示例 data.table 行

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-03-07 18:38:20

解决方案1
1 已采纳 2016-03-07 18:38:20