简体   繁体   English

比较立即过去的值与R中的当前值,也缺少数据

[英]comparing immediate past values to current value in R, missing data too

I am using Rstudio (version .99.903), have a PC (windows 8). 我正在使用Rstudio(0.99.903版),有一台PC(Windows 8)。 I have a question that is really difficult for me. 我有一个对我来说真的很难的问题。 Here is what the data looks like: 数据如下所示:

     Number     Trial       ID  Open date   Enrollment
     420        NCT00091442 9   1/28/2005   0.2
     1476       NCT00301457 26  2/22/2008   1
     10559      NCT01307397 34  7/28/2011   0.6
     6794       NCT00948675 53  5/12/2010   0
     6451       NCT00917384 53  8/17/2010   0.3
     8754       NCT01168973 53  1/19/2011   0.2
     8578       NCT01140347 53  12/30/2011  2.4
     11655      NCT01358877 53  4/2/2012    0.3
     428        NCT00091442 55  9/7/2005    0.1
     112        NCT00065325 62  10/15/2003  0.2
     477        NCT00091442 62  11/11/2005  0.1
     16277      NCT01843374 62  12/16/2013  0.2
     17386      NCT01905657 62  1/8/2014    0.6
     411        NCT00091442 66  1/12/2005   0

What I need to do is compare the enrollment of each date within ID to the one prior to it. 我需要做的是将ID中每个日期的注册与之前的注册进行比较。 If there is no date within the ID prior to it, then the comparison should not be made. 如果ID之前没有日期,则不应进行比较。 For instance, for ID 26, there would be no comparison. 例如,对于ID 26,将没有比较。 Similarly, for ID 53, there would be no comparison for 5/12/2010, but I would like to compare 8/17/2010 to 5/12/2010, and then 1/19/2011 to 8/17/2010 (but not also to 5/12/2010). 同样,对于ID 53,2010年5月12日没有任何比较,但我想将2010年8月17日与2010年5月12日进行比较,然后将2011年1月19日与2010年8月17日进行比较(但也不适用于2010年5月12日)。 The output would ideally be a dot plot that plots the value of the current against the value of the prior (prior would be on y axis, current on the x axis). 理想情况下,输出将是点图,将电流值相对于先验值进行绘制(优先级在y轴上,电流在x轴上)。 Finally, I would need to generate a column that subtracts the current value from the one just prior... 最后,我需要生成一列,该列从之前的值中减去当前值...

There are >20,000 data points. 有> 20,000个数据点。 I've tried to write a script to look back to the prior, but I haven't been been able to control for ID. 我试图编写一个脚本来回顾以前的内容,但是我一直无法控制ID。 Also, I imagine it wouldn't be much different if I looked back one year, two years, five years, etc...? 另外,我想如果回头看一年,两年,五年等等,也不会有太大的不同吗?

Any help would be much appreciated. 任何帮助将非常感激。

As for the data process, I think what you want is to get difference in days between two dates. 至于数据处理,我想您想要的是获得两个日期之间天数的差异。 You can achieve this in a few ways. 您可以通过几种方法来实现。 Here, I chose to use shift() in the data.table package. 在这里,我选择在data.table包中使用shift() You specify type = "lag" and in the function and handle subtraction. 您可以在函数中指定type = "lag"并处理减法。 You do this operation for each ID by specifying by = ID . 您可以通过指定by = ID为每个ID进行此操作。 I cannot visualize what kind of plot you have in your mind. 我无法想象你心目中的情节。 I am happy to support you if you can clarify what you meant in your question. 如果您可以澄清问题的意思,我们很乐意为您提供支持。

library(tidyverse)
library(data.table)

setDT(mydf)[, Opendate := as.IDate(Opendate, format = "%m/%d/%Y")][,
    out := as.numeric(Opendate - shift(Opendate, type = "lag")), by = ID][,
    out := coalesce(out, 0)]


#    Number       Trial ID   Opendate Enrollment  out
# 1:    420 NCT00091442  9 2005-01-28        0.2    0
# 2:   1476 NCT00301457 26 2008-02-22        1.0    0
# 3:  10559 NCT01307397 34 2011-07-28        0.6    0
# 4:   6794 NCT00948675 53 2010-05-12        0.0    0
# 5:   6451 NCT00917384 53 2010-08-17        0.3   97
# 6:   8754 NCT01168973 53 2011-01-19        0.2  155
# 7:   8578 NCT01140347 53 2011-12-30        2.4  345
# 8:  11655 NCT01358877 53 2012-04-02        0.3   94
# 9:    428 NCT00091442 55 2005-09-07        0.1    0
#10:    112 NCT00065325 62 2003-10-15        0.2    0
#11:    477 NCT00091442 62 2005-11-11        0.1  758
#12:  16277 NCT01843374 62 2013-12-16        0.2 2957
#13:  17386 NCT01905657 62 2014-01-08        0.6   23
#14:    411 NCT00091442 66 2005-01-12        0.0    0

DATA 数据

mydf <- structure(list(Number = c(420L, 1476L, 10559L, 6794L, 6451L, 
8754L, 8578L, 11655L, 428L, 112L, 477L, 16277L, 17386L, 411L), 
Trial = structure(c(2L, 3L, 8L, 5L, 4L, 7L, 6L, 9L, 2L, 1L, 
2L, 10L, 11L, 2L), .Label = c("NCT00065325", "NCT00091442", 
"NCT00301457", "NCT00917384", "NCT00948675", "NCT01140347", 
"NCT01168973", "NCT01307397", "NCT01358877", "NCT01843374", 
"NCT01905657"), class = "factor"), ID = c(9L, 26L, 34L, 53L, 
53L, 53L, 53L, 53L, 55L, 62L, 62L, 62L, 62L, 66L), Opendate = structure(c(3L, 
9L, 12L, 11L, 13L, 2L, 8L, 10L, 14L, 5L, 6L, 7L, 4L, 1L), .Label = c("1/12/2005", 
"1/19/2011", "1/28/2005", "1/8/2014", "10/15/2003", "11/11/2005", 
"12/16/2013", "12/30/2011", "2/22/2008", "4/2/2012", "5/12/2010", 
"7/28/2011", "8/17/2010", "9/7/2005"), class = "factor"), 
Enrollment = c(0.2, 1, 0.6, 0, 0.3, 0.2, 2.4, 0.3, 0.1, 0.2, 
0.1, 0.2, 0.6, 0)), .Names = c("Number", "Trial", "ID", "Opendate", 
"Enrollment"), class = "data.frame", row.names = c(NA, -14L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM