简体   繁体   English

在 R 中可视化纵向分类数据的好方法

[英]Good Ways to Visualize Longitudinal Categorical Data in R

[ Update: Although I've accepted an answer, please add another answer if you have additional visualization ideas (whether in R or another language/program). [更新:虽然我已经接受了一个答案,但如果您有其他可视化想法(无论是 R 还是其他语言/程序),请添加另一个答案。 Texts on categorical data analysis don't seem to say much about visualizing longitudinal data, while texts on longitudinal data analysis don't seem to say much about visualizing within-subject changes over time in category membership.关于分类数据分析的文本似乎并没有过多地讲述可视化纵向数据,而有关纵向数据分析的文本似乎并没有过多地讲述可视化对象内类别成员随时间的变化。 Having more answers to this question will make it a better resource on an issue that doesn't get much coverage in standard references.]对这个问题有更多答案将使它成为解决标准参考文献中没有太多报道的问题的更好资源。]

A colleague just gave me a longitudinal categorical data set to look at and I'm trying to figure out how to capture the longitudinal aspect in a visualization.一位同事刚刚给了我一个纵向分类数据集供我查看,我正试图弄清楚如何在可视化中捕捉纵向方面。 I'm posting here, because I'd like to do this in R, but please let me know if it makes sense to also cross-post to Cross-Validated, since cross-posting is generally discouraged.我在这里发帖,因为我想在 R 中执行此操作,但请让我知道交叉发布到交叉验证是否有意义,因为通常不鼓励交叉发布。

Quick background: The data track the academic standing from term to term for students who went through an academic advising program.快速背景:数据跟踪了通过学术咨询计划的学生每学期的学术地位。 The data are in long format and have five variables: "id", "cohort", "term", "standing", and "termGPA".数据为长格式,有五个变量:“id”、“cohort”、“term”、“standing”和“termGPA”。 The first two identify the student and the term in which they were in the advising program.前两个标识学生和他们在建议计划中的学期。 The last three are the terms when the student's academic standing and GPA were recorded.最后三个是记录学生学业成绩和GPA时的条款。 I've pasted in some sample data below using dput .我使用dput在下面粘贴了一些示例数据。

I've created a mosaic plot (see below) that groups students by cohort, standing, and term.我创建了一个马赛克图(见下文),将学生按队列、排名和学期分组。 This shows what fraction of students were in each academic-standing category in each term.这显示了每个学期每个学术地位类别的学生比例。 But this doesn't capture the longitudinal aspect--the fact that individual students are tracked over time.但这并没有捕捉到纵向方面——随着时间的推移跟踪个别学生的事实。 I'd like to track the path that groups of students with a given academic standing take over time.我想跟踪具有给定学术地位的学生群体随时间推移的路径。

For example: Of students with standing "AP" (academic probation) in Fall 2009 ("F09"), what fraction were still AP in future terms, and what fraction moved into other categories (eg, GS, "good standing")?例如:在 2009 年秋季(“F09”)获得“AP”(留校察看)资格的学生中,有多少人在未来仍然是 AP,有多少人进入其他类别(例如,GS,“良好的信誉”)? Are there differences between cohorts in terms of movement between categories with time since entry into the advising program?自进入咨询计划以来,不同类别之间随时间的移动是否存在差异?

I couldn't quite figure out how to capture this longitudinal aspect in an R graphic.我不太明白如何在 R 图形中捕捉这个纵向方面。 The vcd package has facilities for visualizing categorical data, but doesn't seem to address longitudinal categorical data. vcd包具有可视化分类数据的功能,但似乎没有处理纵向分类数据。 Are there "standard" methods for visualizing longitudinal categorical data?是否有可视化纵向分类数据的“标准”方法? Does R have packages designed for this? R是否有为此设计的包? Is long format appropriate for this type of data or would I be better off with wide format?长格式是否适合这种类型的数据,还是使用宽格式会更好?

I would appreciate suggestions for solving this particular problem and also suggestions for articles, books, etc. for learning more about visualizing longitudinal categorical data.对于解决这个特定问题的建议以及文章、书籍等的建议,我将不胜感激,以了解更多关于可视化纵向分类数据的信息。

Here's the code I used to make the mosaic plot.这是我用来制作马赛克图的代码。 The code uses the data listed below with dput .该代码将下面列出的数据与dput一起使用。

library(RColorBrewer)

# create a table object for plotting
df1.tab = table(df1$cohort, df1$term, df1$standing,
            dnn=c("Cohort\nAcademic Standing", "Term", "Standing"))

# create a mosaic plot
plot(df1.tab, las=1, dir=c("h","v","h"), 
     col=brewer.pal(8,"Dark2"),
     main="Fall 2009 and Fall 2010 Cohorts")

Here's the mosaic plot (side question: is there any way to make the columns for the F10 cohort sit directly under and have the same width as the columns for the F09 cohort, even when there's no data for some terms in the F10 cohort?):这是马赛克图(附带问题:是否有任何方法可以使 F10 群组的列直接位于 F09 群组的列下方并具有与 F09 群组的列相同的宽度,即使 F10 群组中的某些术语没有数据?) :

在此处输入图像描述

And here's the data used to create the table and the plot:这是用于创建表格和图表的数据:

df1 =
structure(list(id = c(101L, 102L, 103L, 104L, 105L, 106L, 107L, 
108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 
119L, 120L, 121L, 122L, 123L, 124L, 125L, 101L, 102L, 103L, 104L, 
105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 
116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 101L, 
102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 
113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 
124L, 125L, 101L, 102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 
110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 
121L, 122L, 123L, 124L, 125L, 101L, 102L, 103L, 104L, 105L, 106L, 
107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 
118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 101L, 102L, 103L, 
104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 
115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L, 
101L, 102L, 103L, 104L, 105L, 106L, 107L, 108L, 109L, 110L, 111L, 
112L, 113L, 114L, 115L, 116L, 117L, 118L, 119L, 120L, 121L, 122L, 
123L, 124L, 125L), cohort = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 
1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L), .Label = c("F09", "F10"), class = c("ordered", 
"factor")), term = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("S09", "F09", "S10", 
"F10", "S11", "F11", "S12"), class = c("ordered", "factor")), 
    standing = structure(c(2L, 4L, 1L, 4L, NA, 4L, 1L, NA, NA, 
    NA, NA, 2L, 2L, 1L, 4L, 4L, 1L, 3L, NA, NA, 4L, 3L, 1L, 4L, 
    NA, 2L, 1L, 3L, 3L, NA, 1L, 2L, NA, NA, NA, NA, 2L, 4L, 3L, 
    4L, 4L, 4L, 2L, NA, NA, 4L, 2L, 4L, 4L, NA, 3L, 4L, 6L, 6L, 
    1L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 4L, 6L, 4L, 4L, 1L, 4L, 1L, 
    2L, 4L, 3L, 1L, 4L, 1L, 6L, 1L, 6L, 6L, 7L, 4L, 4L, 2L, 2L, 
    4L, 2L, 6L, 4L, 6L, 7L, 4L, 2L, 4L, 1L, 2L, 4L, 6L, 6L, 4L, 
    2L, 2L, 3L, 6L, 6L, 7L, 4L, 4L, 3L, 4L, 4L, 6L, 2L, 1L, 6L, 
    6L, 4L, 2L, 1L, 7L, 2L, 4L, 6L, 6L, 4L, 4L, 3L, 6L, 4L, 6L, 
    2L, 4L, 4L, 6L, 4L, 4L, 6L, 3L, 2L, 6L, 6L, 4L, 2L, 6L, 3L, 
    4L, 4L, 6L, 6L, 4L, 4L, 5L, 6L, 4L, 6L, 4L, 4L, 4L, 5L, 4L, 
    4L, 6L, 6L, 2L, 6L, 6L, 4L, 3L, 6L, 6L, 4L, 4L, 6L, 6L, 4L, 
    4L), .Label = c("AP", "CP", "DQ", "GS", "DM", "NE", "WD"), class = "factor"), 
    termGPA = c(1.433, 1.925, 1, 1.68, NA, 1.579, 1.233, NA, 
    NA, NA, NA, 2.009, 1.675, 0, 1.5, 1.86, 0.5, 0.94, NA, NA, 
    1.777, 1.1, 1.133, 1.675, NA, 2, 1.25, 1.66, 0, NA, 1.525, 
    2.25, NA, NA, NA, NA, 1.66, 2.325, 0, 2.308, 1.6, 1.825, 
    2.33, NA, NA, 2.65, 2.65, 2.85, 3.233, NA, 1.25, 1.575, NA, 
    NA, 1, 2.385, 3.133, 0, 0, 1.729, 1.075, 0, 4, NA, 2.74, 
    0, 1.369, 2.53, 0, 2.65, 2.75, 0, 0.333, 3.367, 1, NA, 0.1, 
    NA, NA, 1, 2.2, 2.18, 2.31, 1.75, 3.073, 0.7, NA, 1.425, 
    NA, 2.74, 2.9, 0.692, 2, 0.75, 1.675, 2.4, NA, NA, 3.829, 
    2.33, 2.3, 1.5, NA, NA, NA, 2.69, 1.52, 0.838, 2.35, 1.55, 
    NA, 1.35, 0.66, NA, NA, 1.35, 1.9, 1.04, NA, 1.464, 2.94, 
    NA, NA, 3.72, 2.867, 1.467, NA, 3.133, NA, 1, 2.458, 1.214, 
    NA, 3.325, 2.315, NA, 1, 2.233, NA, NA, 2.567, 1, NA, 0, 
    3.325, 2.077, NA, NA, 3.85, 2.718, 1.385, NA, 2.333, NA, 
    2.675, 1.267, 1.6, 1.388, 3.433, 0.838, NA, NA, 0, NA, NA, 
    2.6, 0, NA, NA, 1, 2.825, NA, NA, 3.838, 2.883)), .Names = c("id", 
"cohort", "term", "standing", "termGPA"), row.names = c("101.F09.s09", 
"102.F09.s09", "103.F09.s09", "104.F09.s09", "105.F10.s09", "106.F09.s09", 
"107.F09.s09", "108.F10.s09", "109.F10.s09", "110.F10.s09", "111.F10.s09", 
"112.F09.s09", "113.F09.s09", "114.F09.s09", "115.F09.s09", "116.F09.s09", 
"117.F09.s09", "118.F09.s09", "119.F10.s09", "120.F10.s09", "121.F09.s09", 
"122.F09.s09", "123.F09.s09", "124.F09.s09", "125.F10.s09", "101.F09.f09", 
"102.F09.f09", "103.F09.f09", "104.F09.f09", "105.F10.f09", "106.F09.f09", 
"107.F09.f09", "108.F10.f09", "109.F10.f09", "110.F10.f09", "111.F10.f09", 
"112.F09.f09", "113.F09.f09", "114.F09.f09", "115.F09.f09", "116.F09.f09", 
"117.F09.f09", "118.F09.f09", "119.F10.f09", "120.F10.f09", "121.F09.f09", 
"122.F09.f09", "123.F09.f09", "124.F09.f09", "125.F10.f09", "101.F09.s10", 
"102.F09.s10", "103.F09.s10", "104.F09.s10", "105.F10.s10", "106.F09.s10", 
"107.F09.s10", "108.F10.s10", "109.F10.s10", "110.F10.s10", "111.F10.s10", 
"112.F09.s10", "113.F09.s10", "114.F09.s10", "115.F09.s10", "116.F09.s10", 
"117.F09.s10", "118.F09.s10", "119.F10.s10", "120.F10.s10", "121.F09.s10", 
"122.F09.s10", "123.F09.s10", "124.F09.s10", "125.F10.s10", "101.F09.f10", 
"102.F09.f10", "103.F09.f10", "104.F09.f10", "105.F10.f10", "106.F09.f10", 
"107.F09.f10", "108.F10.f10", "109.F10.f10", "110.F10.f10", "111.F10.f10", 
"112.F09.f10", "113.F09.f10", "114.F09.f10", "115.F09.f10", "116.F09.f10", 
"117.F09.f10", "118.F09.f10", "119.F10.f10", "120.F10.f10", "121.F09.f10", 
"122.F09.f10", "123.F09.f10", "124.F09.f10", "125.F10.f10", "101.F09.s11", 
"102.F09.s11", "103.F09.s11", "104.F09.s11", "105.F10.s11", "106.F09.s11", 
"107.F09.s11", "108.F10.s11", "109.F10.s11", "110.F10.s11", "111.F10.s11", 
"112.F09.s11", "113.F09.s11", "114.F09.s11", "115.F09.s11", "116.F09.s11", 
"117.F09.s11", "118.F09.s11", "119.F10.s11", "120.F10.s11", "121.F09.s11", 
"122.F09.s11", "123.F09.s11", "124.F09.s11", "125.F10.s11", "101.F09.f11", 
"102.F09.f11", "103.F09.f11", "104.F09.f11", "105.F10.f11", "106.F09.f11", 
"107.F09.f11", "108.F10.f11", "109.F10.f11", "110.F10.f11", "111.F10.f11", 
"112.F09.f11", "113.F09.f11", "114.F09.f11", "115.F09.f11", "116.F09.f11", 
"117.F09.f11", "118.F09.f11", "119.F10.f11", "120.F10.f11", "121.F09.f11", 
"122.F09.f11", "123.F09.f11", "124.F09.f11", "125.F10.f11", "101.F09.s12", 
"102.F09.s12", "103.F09.s12", "104.F09.s12", "105.F10.s12", "106.F09.s12", 
"107.F09.s12", "108.F10.s12", "109.F10.s12", "110.F10.s12", "111.F10.s12", 
"112.F09.s12", "113.F09.s12", "114.F09.s12", "115.F09.s12", "116.F09.s12", 
"117.F09.s12", "118.F09.s12", "119.F10.s12", "120.F10.s12", "121.F09.s12", 
"122.F09.s12", "123.F09.s12", "124.F09.s12", "125.F10.s12"), reshapeLong = structure(list(
    varying = list(c("s09as", "f09as", "s10as", "f10as", "s11as", 
    "f11as", "s12as"), c("s09termGPA", "f09termGPA", "s10termGPA", 
    "f10termGPA", "s11termGPA", "f11termGPA", "s12termGPA")), 
    v.names = c("standing", "termGPA"), idvar = c("id", "cohort"
    ), timevar = "term"), .Names = c("varying", "v.names", "idvar", 
"timevar")), class = "data.frame")

Here are a few ideas for plotting your data.以下是绘制数据的一些想法。 I've used ggplot2, and I've reformatted the data a bit in places.我使用过 ggplot2,并且在某些地方对数据进行了一些重新格式化。

Figure 1图1

在此处输入图像描述 I've used a stacked barplot to mimic your mosaic plot and solve the alignment issue.我使用堆叠条形图来模仿您的马赛克图并解决对齐问题。

Figure 2图 2

在此处输入图像描述 Data points for each student are connected by a gray line, making this reminiscent of a parallel coordinates plot.每个学生的数据点由一条灰线连接,让人联想到平行坐标图。 Coloring the points shows the categorical standing.为点着色显示分类地位。 Using GPA on the y-axis helps spread out the points to reduce overplotting, and shows correlation of standing and GPA.在 y 轴上使用 GPA 有助于分散点数以减少过度绘制,并显示排名和 GPA 的相关性。 A major problem is that many valid standing datapoints drop out because they lack a matching termGPA value.一个主要问题是许多有效standing数据点因为缺少匹配的 termGPA 值而丢失。

Figure 3图 3

在此处输入图像描述 Here I've created a new variable called initial_standing to use for facetting.在这里,我创建了一个名为 initial_standing 的新变量,用于分面。 Each panel contains students who match in both cohort and initial_standing.每个小组都包含在 cohort 和 initial_standing 上都匹配的学生。 Plotting the id as text makes this figure a bit cluttered, but could be useful in some cases.将 id 绘制为文本会使该图有点混乱,但在某些情况下可能会有用。

Figure 4图 4

在此处输入图像描述 This plot is like a heatmap where each row is a student.这个图就像一个热图,其中每一行都是一个学生。 I controlled the order of the id axis to force initial_standing and cohort groupings to stay together.我控制了id轴的顺序以强制 initial_standing 和 cohort 分组保持在一起。 If you have many more rows, you may want to consider sorting rows by some type of clustering.如果您有更多行,您可能需要考虑按某种类型的聚类对行进行排序。

library(ggplot2)

# Create new data frame for determining initial standing.
standing_data = data.frame(id=unique(df1$id), initial_standing=NA, cohort=NA)

for (i in 1:nrow(standing_data)) {
    id = standing_data$id[i]
    subdat = df1[df1$id == id, ]
    subdat = subdat[complete.cases(subdat), ]
    initial_standing = subdat$standing[which.min(subdat$term)]
    standing_data[i, "initial_standing"] = as.character(initial_standing)
    standing_data[i, "cohort"] = as.character(subdat$cohort[1])
}

standing_data$cohort = factor(standing_data$cohort, levels=levels(df1$cohort))
standing_data$initial_standing = factor(standing_data$initial_standing,
                                        levels=levels(df1$standing))

# Add the new column (initial_standing) to df1.
df1 = merge(df1, standing_data[, c("id", "initial_standing")], by="id")

# Remove rows where standing is missing. Make some plots tidier.
df1 = df1[!is.na(df1$standing), ]

# Create id factor, controlling the sort order of the levels.     
id_order = order(standing_data$initial_standing, standing_data$cohort)
df1$id = factor(df1$id, levels=as.character(standing_data$id)[id_order])


p1 = ggplot(df1, aes(x=term, fill=standing)) +
     geom_bar(position="fill", colour="grey20", size=0.5, width=1.0) +
     facet_grid(cohort ~ .) +
     scale_fill_brewer(palette="Set1")

p2 = ggplot(df1, aes(x=term, y=termGPA, group=id)) + 
     geom_line(colour="grey70") + 
     geom_point(aes(colour=standing), size=4) + 
     facet_grid(cohort ~ .) +
     scale_colour_brewer(palette="Set1")

p3 = ggplot(df1, aes(x=term, y=termGPA, group=id)) +
     geom_line(colour="grey70") + 
     geom_point(aes(colour=standing), size=4) + 
     geom_text(aes(label=id), hjust=-0.30, size=3) +
     facet_grid(initial_standing ~ cohort) +
     scale_colour_brewer(palette="Set1")


p4 = ggplot(df1, aes(x=term, y=id, fill=standing)) + 
     geom_tile(colour="grey20") +
     facet_grid(initial_standing ~ ., space="free_y", scales="free_y") +
     scale_fill_brewer(palette="Set1") +
     opts(panel.grid.major=theme_blank()) +
     opts(panel.grid.minor=theme_blank())

ggsave("plot_1.png", p1, width=10, height=6.25, dpi=80)
ggsave("plot_2.png", p2, width=10, height=6.25, dpi=80)
ggsave("plot_3.png", p3, width=10, height=6.25, dpi=80)
ggsave("plot_4.png", p4, width=10, height=6.25, dpi=80)

In researching my question, I've found a few other options that I'll list here.在研究我的问题时,我发现了一些其他选项,我将在此处列出。

A number of relatively new R packages are designed for visualizing and analyzing "life history" or "multistate sequence" data.许多相对较新的 R 包专为可视化和分析“生命史”或“多状态序列”数据而设计。 The idea is that over time people (or objects) enter and exit various categories--for example, career changes, marriage and divorce, health and disease, or, in my case, categories of academic standing in college.这个想法是,随着时间的推移,人们(或物体)进入和退出各种类别——例如,职业变化、结婚和离婚、健康和疾病,或者,在我的例子中,大学学术地位的类别。

R packages for visualizing sequence or life history data include biograph , mentioned by @timriffe in a comment above, and TraMineR .用于可视化序列或生命历史数据的 R 包包括 @timriffe 在上面的评论中提到的biographTraMineR The author of the biograph package, Frans Willekens, has a book on the package, Biograph. biograph 包的作者 Frans Willekens 有一本书介绍这个包, Biograph。 Multistate analysis of life histories with R , that will be published by Springer this fall.使用 R 进行生活史的多状态分析,将于今年秋天由 Springer 出版。 TraMineR has a detailed user manual at the link above and also a shorter JSS article . TraMineR 在上面的链接中有详细的用户手册,还有一篇较短的JSS 文章 JSS also has a special issue on multi-state models in the context of risk analysis that discusses additional R packages for multistate modeling. JSS在风险分析的上下文中还有一个关于多状态模型的特殊问题,讨论了用于多状态建模的附加 R 包。

I also found some specialized software designed to visualize movements between categories over time.我还发现了一些专门用于可视化类别之间随时间变化的软件。 Parallel Sets is a simple, free program for producing basic visualizations, although it has limited flexibility. Parallel Sets是一个简单的免费程序,用于生成基本的可视化效果,尽管它的灵活性有限。 Lifeflow is more sophisticated. Lifeflow更加复杂。 It's also free, but you have to send an email to the creator requesting a copy.它也是免费的,但您必须向创建者发送电子邮件以索取副本。

I'll add more details to this answer, once I've had a chance to try out these tools.一旦我有机会试用这些工具,我就会为这个答案添加更多细节。

I wish I had found @bdemarest's answer before I wrote an R package to solve this problem, but since the OP requested additional updates, I'll share one more solution.我希望在编写 R 程序包来解决此问题之前找到@bdemarest 的答案,但由于 OP 要求进行其他更新,我将再分享一个解决方案。 What bdemarest suggested in Figure 4 is what I have been calling a type of horizontal line plot. bdemarest 在图 4 中建议的是我一直所说的一种水平线图。

In developing the longCatEDA R package, we found that sorting the data was crucial to making useful plots (see example(sorter) and the report linked in the comment below for technical details), especially as the size of the problem became large.在开发longCatEDA R 包时,我们发现对数据进行排序对于制作有用的图至关重要(有关技术细节,请参见example(sorter)和下面评论中链接的报告),尤其是当问题的规模变大时。 For example, we started the problem with daily drinking data (abstinent, use, abuse) for several thousand participants over 3 years (>1000 days).例如,我们从 3 年(>1000 天)数千名参与者的每日饮酒数据(戒酒、使用、滥用)开始解决问题。

The code to apply the horizontal line plot to @eipi10's data is below.将水平线图应用于 @eipi10 的数据的代码如下。 Figure 1 stratifies by term , and Figure 2 stratifies by first status as with Figure 4 of @bdemarest, though the results are not identical due to within strata sorting.图 1 按term分层,图 2 按第一个状态分层,与 @bdemarest 的图 4 一样,但由于层内排序,结果并不相同。

Figure 1图1

按术语分层的水平线图

Figure 2图 2

按初始状态分层的水平线图

# libraries
install.packages('longCatEDA')
library(longCatEDA)
library(RColorBrewer)

# transform data long to wide
dfw <- reshape(df1,
           timevar = 'term',
           idvar = c('id', 'cohort'),
           direction = 'wide')

# set up objects required by longCat()
y <- dfw[,seq(3,15,by=2)]
Labels <- levels(df1$standing)
tLabels <- levels(df1$term)
groupLabels <- levels(dfw$cohort)

# use the same colors as bdemarest
cols <- brewer.pal(7, "Set1")

# plot the longCat object
png('plot1.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
lc <- longCat(y=y, Labels=Labels, tLabels=tLabels, id=dfw$id) 
longCatPlot(lc, cols=cols, xlab='Term', lwd=8, legendBuffer=0)
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()

# stratify by term
png('plot2.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
lc.g <- sorter(lc, group=dfw$cohort, groupLabels=groupLabels)
longCatPlot(lc.g, cols=cols, xlab='Term', lwd=8, legendBuffer=0) 
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()

# stratify by first status, akin to Figure 4 by bdemarest
png('plot2.png', width=10, height=6.25, units='in', res=100)
par(bg='cornsilk3', mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
first <- apply(!is.na(y), 1, function(x) which(x)[1])
first <- y[cbind(seq_along(first), first)]
lc.1 <- sorter(lc, group=factor(first), groupLabels = sort(unique(first)))
longCatPlot(lc.1, cols=cols, xlab='Term', lwd=8, legendBuffer=0) 
legend(8.1, 25, legend=Labels, col=cols, lty=1, lwd=4)
dev.off()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM