简体   繁体   English

如何用不相等的观测数汇总时间序列数据与R

[英]How to summarise time-series data with unequal number of observations with R

I have a large dataframe (86000 rows) consisting of several patients, each of which had several occasions of blood tests (just 3 tests: T1, T2 and T3) during their stay. 我有一个大的数据框(86000行),其中包括几位患者,每位患者在其逗留期间均进行了几次验血(仅进行了3次测试:T1,T2和T3)。 Some of these patients were hospitalised for 3 days, some for 168 days. 这些患者中有些住院了3天,有些住院了168天。

This is just a fraction of the output from count function , which shows the big variation of the days spent in the hospital: 这只是count函数输出的一小部分,它显示了住院时间的巨大变化:

No  Id     Days
148 29757  111
149 30368   36
150 31062   29
151 31993   24
152 32198   51
153 32438    6
154 32836   74
155 32944   24
156 33467   39
157 36108   90
158 36849    6
159 37136    3

I used aggregate to calculate the means etc, but I would like to see a summary of who did improve or deteriorate during their stay. 我使用汇总来计算均值等,但是我想总结一下谁在逗留期间确实有所改善或恶化。

I think that this would involve to extract at least the first and last tests, and take the difference (the lower, the better). 我认为,这将涉及至少提取第一个和最后一个测试,并取其差(越低越好)。 But I couldn't find a way to do that. 但是我找不到办法。

I thought that an easier solution would be to transform the whole results to ordered data (according to the normal ranges of the tests) and see how many of them had abnormally low or high values. 我认为一个更简单的解决方案是将整个结果转换为有序数据(根据测试的正常范围),并查看其中有多少个值异常低或异常高。 Unfortunately almost every patient has lows and highs. 不幸的是,几乎每个病人都有高潮和低潮。

Ideally, I would like to see the progress of several patients (or groups of paients) over time. 理想情况下,我希望了解几位患者(或一组患者)随时间的进展。 But since they were hospitalised in different time-frames, the (over-simplified) result is like that: 但是,由于他们在不同的时间范围内住院,(过于简化)的结果是这样的:

仅2位患者的结果,他们在完全不同的时间范围内住院

As you can see, the first patient (red dots) started with mediocre values, quickly worsened (high values), and then improved (lower values). 如您所见,第一位患者(红点)以中等水平开始,迅速恶化(高水平),然后好转(较低水平)。 The progress of the second patient is not clear, since his/her stay was probably short. 第二名患者的进展尚不清楚,因为他/她的住院时间可能很短。

Could someone suggest a starter (code or idea)? 有人可以建议一个入门者(代码或想法)吗? I checked some questions about multiple time-series plots with unequal observations , but they are not helpful in my case. 用不相等的观测值检查关于多个时间序列图的 一些问题 ,但是它们对我的情况没有帮助。 An example (anonymised) dataset is here: 示例数据集(匿名)在这里:

structure(list(Id = c("10200", "10200", "10200", "10200", "10200", 
"10200", "10700", "10700", "10700", "10700", "10700", "10700", 
"10700", "10700", "10700", "10700", "10700", "10700", "10700", 
"10700", "10700", "10766", "10766", "10766", "10766", "10766", 
"10766", "10766", "10766", "10766", "10766", "10766", "10766", 
"10766", "10766", "10766", "10766", "10766", "10766", "10766"
), Date = structure(c(15068, 15068, 15068, 15069, 15069, 15069, 
15072, 15072, 15072, 15072, 15072, 15072, 15073, 15073, 15073, 
15075, 15075, 15075, 15078, 15078, 15078, 15073, 15074, 15074, 
15075, 15075, 15075, 15075, 15076, 15076, 15076, 15078, 15078, 
15078, 15081, 15082, 15083, 15084, 15085, 15085), class = "Date"), 
    Test = c("T1", "T2", "T3", "T1", "T2", "T3", "T1", "T1", 
    "T2", "T2", "T3", "T3", "T1", "T2", "T3", "T1", "T2", "T3", 
    "T1", "T2", "T3", "T1", "T1", "T2", "T1", "T1", "T2", "T2", 
    "T1", "T2", "T3", "T1", "T2", "T3", "T1", "T1", "T2", "T1", 
    "T1", "T2"), Result = c(131, 4.53, 5.4, 108, 3.19, 3.7, 125, 
    NA, 1.26, NA, NA, 3.8, 125, 0.97, 4.2, 73, 0.84, 6.6, 48, 
    0.52, 4.8, 60, 75, 0.83, 52, 51, 0.62, 0.65, 40, 0.57, 4.1, 
    45, 0.54, 3.7, 96, 77, 1.04, 134, 144, 0.95)), .Names = c("Id", 
"Date", "Test", "Result"), row.names = c(3L, 6L, 4L, 2L, 1L, 
5L, 10L, 14L, 9L, 19L, 8L, 11L, 20L, 18L, 7L, 17L, 13L, 21L, 
12L, 15L, 16L, 22L, 28L, 29L, 24L, 31L, 26L, 33L, 34L, 32L, 37L, 
23L, 35L, 25L, 38L, 36L, 30L, 27L, 39L, 40L), class = "data.frame")

I don't know if this is what you want, but you can use dplyr package. 我不知道这是不是您想要的,但是您可以使用dplyr包。 The code below will group the data by "Id", then find the first & last values in Result and finally calculate the "difference" in a new column 下面的代码将按“ Id”对数据进行分组,然后在Result中查找第一个和最后一个值,最后在新列中计算“差”

mydata <- structure(list(Id=c ( "10200", "10200", "10200", "10200", "10200", "10200", "10700", "10700", "10700", "10700", "10700", "10700", "10700", "10700", "10700", "10700", "10700", "10700", "10700", "10700", "10700", "10766", "10766", "10766",
"10766", "10766", "10766", "10766", "10766", "10766", "10766", "10766", "10766", "10766", "10766", "10766", "10766", "10766", "10766", "10766" ), Date=s tructure(c(15068, 15068, 15068, 15069, 15069, 15069, 15072, 15072, 15072, 15072, 15072, 15072, 15073, 15073,
15073, 15075, 15075, 15075, 15078, 15078, 15078, 15073, 15074, 15074, 15075, 15075, 15075, 15075, 15076, 15076, 15076, 15078, 15078, 15078, 15081, 15082, 15083, 15084, 15085, 15085), class="Date" ), Test=c ( "T1", "T2", "T3", "T1", "T2", "T3", "T1",
"T1", "T2", "T2", "T3", "T3", "T1", "T2", "T3", "T1", "T2", "T3", "T1", "T2", "T3", "T1", "T1", "T2", "T1", "T1", "T2", "T2", "T1", "T2", "T3", "T1", "T2", "T3", "T1", "T1", "T2", "T1", "T1", "T2"), Result=c (131, 4.53, 5.4, 108, 3.19, 3.7, 125, NA, 1.26,
NA, NA, 3.8, 125, 0.97, 4.2, 73, 0.84, 6.6, 48, 0.52, 4.8, 60, 75, 0.83, 52, 51, 0.62, 0.65, 40, 0.57, 4.1, 45, 0.54, 3.7, 96, 77, 1.04, 134, 144, 0.95)), .Names=c ( "Id", "Date", "Test", "Result"), row.names=c (3L, 6L, 4L, 2L, 1L, 5L, 10L, 14L, 9L, 19L,
8L, 11L, 20L, 18L, 7L, 17L, 13L, 21L, 12L, 15L, 16L, 22L, 28L, 29L, 24L, 31L, 26L, 33L, 34L, 32L, 37L, 23L, 35L, 25L, 38L, 36L, 30L, 27L, 39L, 40L), class="data.frame" )

library(dplyr) 

result <- mydata %>%
  group_by(Id) %>%  
  summarise_each(funs(first, last), Result) %>%
  mutate(difference = first - last)
result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM