[英]How to find closest value match between 2 data frames in R
Problem: I have 2 datasets with no matching identifiers (like ID) and need to find the closest match in df1$time to df2$tstart.问题:我有 2 个没有匹配标识符(如 ID)的数据集,需要在 df1$time 中找到与 df2$tstart 最接近的匹配项。 df1 (with time column) has 660,000 rows with time stamps approximately every 0.00125 s.
df1(带有时间列)有 660,000 行带有时间戳,大约每 0.00125 秒。 Whatever the closest match is to df2$tstart, I would like a new column made (df1$trial_start) that says "yes", otherwise "no".
无论与 df2$tstart 最接近的匹配项是什么,我都希望创建一个新列 (df1$trial_start),上面写着“是”,否则就是“否”。
I've tried findInterval, but it only seems to match in ascending order, and doesn't check values in both directions.我试过 findInterval,但它似乎只按升序匹配,并且不检查两个方向的值。 In the below code, it looks good for most of the outputs, but there are some indices where the value after the returned index is closer to $tstart
在下面的代码中,大多数输出看起来都不错,但有一些索引返回索引后的值更接近 $tstart
#my actual code:
index_closest <- findInterval(iti_summaries_2183[["24"]]$tstart, poke_1s$time)
poke_1s$trial_start <- ifelse(seq_len(nrow(poke_1s)) %in% index_closest, "yes", "no")
I've also tried which.min, which doesn't work since the lists lengths don't match.我也试过 which.min,但它不起作用,因为列表长度不匹配。 Additionally, I've fought with roll = "nearest" like here but the functions return values and I'm not sure how to create a new column and assign y/n.
此外,我像这里一样与 roll = "nearest" 打过仗,但函数返回值,我不确定如何创建新列并分配 y/n。
Code to replicate problem:复制问题的代码:
n <- 773
df1 <- structure(list(initiate = sample(c(0,1), replace=TRUE, size=n),
left = sample(c(0,1), replace=TRUE, size=n),
right = sample(c(0,1), replace=TRUE, size=n),
time = seq(from = 2267.2, to = 2363.75, by = 0.125)))
df1 <- data.frame(df1)
df2 <- structure(list(trial = c(156:162),
control = c(0, 0, 0, 0, 3, 0, 3),
t_start = c(2267.231583, 2289.036355, 2298.046849, 2318.933635, 2328.334036, 2347.870449, 2363.748095),
t_end = c(2268.76760, 2290.83370, 2299.38547, 2320.71400, 2329.93985, 2349.15464, 2365.12455)),
class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
cols = list(trial = structure(list(), class = c("collector_double",
"collector")), control = structure(list(), class = c("collector_double",
"collector")), t_start = structure(list(), class = c("collector_double",
"collector")), t_end = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
If I understand your question correctly:如果我正确理解你的问题:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2,.(initiate,left,right,x.time,trial,control,t_start,t_end,
trial_start=fifelse(x.time>t_start&x.time<t_end,'Y','N')),
on=.(time=t_start),roll='nearest']
initiate left right x.time trial control t_start t_end trial_start
<num> <num> <num> <num> <int> <num> <num> <num> <char>
1: 0 0 1 2267.200 156 0 2267.232 2268.768 N
2: 0 0 1 2289.075 157 0 2289.036 2290.834 Y
3: 0 0 1 2298.075 158 0 2298.047 2299.385 Y
4: 1 1 1 2318.950 159 0 2318.934 2320.714 Y
5: 1 1 1 2328.325 160 3 2328.334 2329.940 N
6: 0 0 1 2347.825 161 0 2347.870 2349.155 N
7: 1 1 0 2363.700 162 3 2363.748 2365.125 N
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.