简体   繁体   English

R:如何可视化随时间变化的二进制/分类数据

[英]R: How to visualize change in binary/categorical data over time

>dput(data)
structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 
3, 3), Dx = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1), Month = c(0, 
6, 12, 18, 24, 0, 6, 12, 18, 24, 0, 6, 12, 18, 24), score = c(0, 
0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0)), .Names = c("ID", 
"Dx", "Month", "score"), row.names = c(NA, -15L), class = "data.frame")

>data
    ID Dx Month score
1   1  1     0     0
2   1  1     6     0
3   1  1    12     0
4   1  1    18     1
5   1  1    24     1
6   2  1     0     1
7   2  1     6     1
8   2  2    12     1
9   2  2    18     0
10  2  2    24     1
11  3  1     0     0
12  3  1     6     0
13  3  1    12     0
14  3  1    18     0
15  3  1    24     0

Suppose I have the above data.frame. 假设我有上面的data.frame。 I have 3 patients ( ID = 1, 2 or 3). 我有3位患者( ID = 1、2或3)。 Dx is the diagnosis ( Dx = 1 is normal, = 2 is diseased). Dx是诊断( Dx = 1正常,= 2患病)。 There is a month variable. 有一个月份变量。 And last but not least, is a test score variable. 最后但并非最不重要的是测试分数变量。 The participants' test score is binary, and it can change from 0 or 1 or revert back from 1 to 0. I am having trouble coming up with a way to visualize this data. 参与者的测试成绩是二进制的,并且可以从0或1更改或从1还原为0。我很难找到一种可视化此数据的方法。 I would like an informative graph that looks at: 我想要一个内容丰富的图表,其中包含:

  1. The trend of the participants' test scores over time. 参与者的测试分数随时间变化的趋势。
  2. How that trend compares to the participants' diagnosis over time 该趋势与参与者随时间的诊断相比如何

In my real dataset I have over 800 participants, so I do not want to construct 800 separate graphs ... I think the test score variable being binary really has me stumped. 在我的真实数据集中,我有800多名参与者,所以我不想构造800个单独的图...我认为测试分数变量为二进制确实让我感到困惑。 Any help would be appreciated. 任何帮助,将不胜感激。

With ggplot2 you can make faceted plots with subplots for each patient (see my solution for dealing with the large number of plots below). 使用ggplot2您可以为每个患者创建带有子图的多面图(请参阅下面的我的解决方案,以处理大量图)。 An example visualization: 可视化示例:

library(ggplot2)
ggplot(data, aes(x=Month, y=score, color=factor(Dx))) +
  geom_point(size=5) +
  scale_x_continuous(breaks=c(0,6,12,18,24)) +
  scale_color_discrete("Diagnosis",labels=c("normal","diseased")) +
  facet_grid(.~ID) +
  theme_bw()

which gives: 这使:

在此处输入图片说明


Including 800 patients in one plot might be a bit too much as already mentioned in the comments of the question. 正如问题评论中已经提到的那样,在一个小区中包括800名患者可能有点过多。 There are several solutions to this problem: 有几种解决此问题的方法:

  1. Aggregate the data. 汇总数据。
  2. Create patient subgroups and make a plot for each subgroup. 创建患者亚组并为每个亚组绘制一个图。
  3. Filter out all the patients who have never been ill. 过滤掉所有从未生病的患者。

With regard to the last suggestion, you can do that with the following code (which I adapted from an answer to one of my own questions): 关于最后一个建议,你可以做到这一点与下面的代码(这是我改编自一个答案 ,以我自己的问题之一):

deleteable <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))
data2 <- data[deleteable==0,]

You can use this as well for creating a new variable identifying patient who have been ill: 您也可以使用它来创建一个新的变量来识别患病的患者:

data$neverill <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))

Then you can for example aggregate the data with the several grouping variables (eg Month , neverill ). 然后,您可以例如使用几个分组变量(例如Monthneverill )聚合数据。

Note: A lot of the following data manipulation needs to be done for part 2. Part 1 is less complex, and you can see it fit in below. 注意:对于第2部分,需要完成以下许多数据操作。第1部分不太复杂,您可以在下面看到它。

Uses 用途

library(data.table)
library(ggplot2)
library(reshape2)

To Compare 比较

First, change the Dx from 1 to 2 to 0 to 1 (Assuming that a 0 in score corresponds to a 1 in Dx) 首先,将Dx从1更改为2,再将0更改为1(假设分数的0对应于Dx的1)

data$Dx <- data$Dx - 1

Now, create a matrix that returns a 1 for a 1 diagnosis with a 0 test, and a -1 for a 1 test with a 0 diagnosis. 现在,创建一个矩阵,该矩阵对于0诊断的1诊断将返回1,对于1诊断0的诊断将返回-1。

compare <- matrix(c(0,1,-1,0),ncol = 2,dimnames = list(c(0,1),c(0,1)))
> compare
  0  1
0 0 -1
1 1  0

Now, lets score every event. 现在,让每个事件得分。 This simply looks up the matrix above for every entry in your matrix: 这只是为矩阵中的每个条目查找上面的矩阵:

data$calc <- diag(compare[as.character(data$Dx),as.character(data$score)])

*Note: This can be sped up for large matrices using matching, but it is a quick fix for smaller sets like yours *注意:使用匹配可以加快大型矩阵的速度,但是对于像您这样的较小集合,这是快速解决方案

To allow us to use data.table aggregation: 为了允许我们使用data.table聚合:

data <- data.table(data)

Now we need to create our variables: 现在我们需要创建变量:

tograph <- melt(data[, list(ScoreTrend = sum(score)/.N, 
                            Type = sum(calc)/length(calc[calc != 0]), 
                            Measure = sum(abs(calc))), 
                     by = Month],
                id.vars = c("Month"))
  • ScoreTrend: This calculates the proportion of positive scores in each month. ScoreTrend:计算每个月中积极得分的比例。 Shows the trend of scores over time 显示分数随时间变化的趋势
  • Type: Shows the proportion of -1 vs 1 over time. 类型:显示-1对1随时间的比例。 If this returns -1, all events were score = 1, diag = 0. If it returns 1, all events were diag = 1, score = 0. A zero would mean a balance between the two 如果返回-1,则所有事件的得分= 1,diag =0。如果返回1,则所有事件的诊断值= 1,diag =0。0表示两者之间的平衡。
  • Measure: The raw number of incorrect events. 度量:错误事件的原始数量。

We melt this data frame along month so that we can create a facet graph. 我们会沿着月份融化此数据框,以便我们可以创建构面图。

If there are no incorrect events, we will get a NaN for Type. 如果没有不正确的事件,我们将获得类型的NaN。 To set this to 0: 要将其设置为0:

tograph[value == NaN, value := 0]

Finally, we can plot 最后,我们可以绘制

ggplot(tograph, aes(x = Month, y = value)) + geom_line() + facet_wrap(~variable, ncol = 1)

We can now see, in one plot: 现在,我们可以在一个图中看到:

  • The number of positive scores by month 每月的阳性分数数
  • The proportion of under vs. over diagnosis 诊断不足与诊断过度的比例
  • The number of incorrect diagnoses. 错误诊断的数量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM