简体   繁体   English

如何在 R 中找到 2 个数据帧之间最接近的值匹配

[英]How to find closest value match between 2 data frames in R

Problem: I have 2 datasets with no matching identifiers (like ID) and need to find the closest match in df1$time to df2$tstart.问题:我有 2 个没有匹配标识符(如 ID)的数据集,需要在 df1$time 中找到与 df2$tstart 最接近的匹配项。 df1 (with time column) has 660,000 rows with time stamps approximately every 0.00125 s. df1(带有时间列)有 660,000 行带有时间戳,大约每 0.00125 秒。 Whatever the closest match is to df2$tstart, I would like a new column made (df1$trial_start) that says "yes", otherwise "no".无论与 df2$tstart 最接近的匹配项是什么,我都希望创建一个新列 (df1$trial_start),上面写着“是”,否则就是“否”。

I've tried findInterval, but it only seems to match in ascending order, and doesn't check values in both directions.我试过 findInterval,但它似乎只按升序匹配,并且不检查两个方向的值。 In the below code, it looks good for most of the outputs, but there are some indices where the value after the returned index is closer to $tstart在下面的代码中,大多数输出看起来都不错,但有一些索引返回索引后的值更接近 $tstart

#my actual code: 
index_closest <- findInterval(iti_summaries_2183[["24"]]$tstart, poke_1s$time)
poke_1s$trial_start <- ifelse(seq_len(nrow(poke_1s)) %in% index_closest, "yes", "no")

I've also tried which.min, which doesn't work since the lists lengths don't match.我也试过 which.min,但它不起作用,因为列表长度不匹配。 Additionally, I've fought with roll = "nearest" like here but the functions return values and I'm not sure how to create a new column and assign y/n.此外,我像这里一样与 roll = "nearest" 打过仗,但函数返回值,我不确定如何创建新列并分配 y/n。

Code to replicate problem:复制问题的代码:

n <- 773
df1 <- structure(list(initiate = sample(c(0,1), replace=TRUE, size=n), 
                      left = sample(c(0,1), replace=TRUE, size=n), 
                      right = sample(c(0,1), replace=TRUE, size=n), 
                      time = seq(from = 2267.2, to = 2363.75, by = 0.125)))

df1 <- data.frame(df1)
                
df2 <- structure(list(trial = c(156:162), 
                      control = c(0, 0, 0, 0, 3, 0, 3), 
                      t_start = c(2267.231583, 2289.036355, 2298.046849, 2318.933635, 2328.334036, 2347.870449, 2363.748095), 
                      t_end = c(2268.76760, 2290.83370, 2299.38547, 2320.71400, 2329.93985, 2349.15464, 2365.12455)), 
                 class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L), spec = structure(list(
    cols = list(trial = structure(list(), class = c("collector_double", 
    "collector")), control = structure(list(), class = c("collector_double", 
    "collector")), t_start = structure(list(), class = c("collector_double", 
    "collector")), t_end = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))

If I understand your question correctly:如果我正确理解你的问题:

library(data.table)

setDT(df1)
setDT(df2)

df1[df2,.(initiate,left,right,x.time,trial,control,t_start,t_end,
          trial_start=fifelse(x.time>t_start&x.time<t_end,'Y','N')),
    on=.(time=t_start),roll='nearest']

   initiate  left right   x.time trial control  t_start    t_end trial_start
      <num> <num> <num>    <num> <int>   <num>    <num>    <num>      <char>
1:        0     0     1 2267.200   156       0 2267.232 2268.768           N
2:        0     0     1 2289.075   157       0 2289.036 2290.834           Y
3:        0     0     1 2298.075   158       0 2298.047 2299.385           Y
4:        1     1     1 2318.950   159       0 2318.934 2320.714           Y
5:        1     1     1 2328.325   160       3 2328.334 2329.940           N
6:        0     0     1 2347.825   161       0 2347.870 2349.155           N
7:        1     1     0 2363.700   162       3 2363.748 2365.125           N

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM