简体   繁体   English

R 中的结构化和交叉引用时限数据点

[英]structuring and cross-referencing time-bound data points in R

This is a two-part question, the first concerns how to structure my data, the second concerns asks for suggestions regarding my research design.这是一个分为两部分的问题,第一个问题是关于如何构建我的数据,第二个问题是关于我的研究设计的建议。

I have three sets of data that look like this:我有三组数据,如下所示:

Sample of data set 1数据集样本 1

start time开始时间 stop time停止时间 char字符
0 0 9.719 9.719 A一种
9.719 9.719 11.735 11.735 B
11.735 11.735 14.183 14.183 A一种
14.183 14.183 16.554 16.554 C C
16.554 16.554 18.482 18.482 A一种
18.482 18.482 19.553 19.553 B

They are csv format and were exported from a video-annotation software tool.它们是 csv 格式,是从视频注释软件工具导出的。 Each row represents an annotation, the first column shows when the annotation begins, and the second column shows when the annotation ends.每行代表一个注释,第一列显示注释开始的时间,第二列显示注释结束的时间。 The values in the third column are references to a particular character that is being depicted and/or talked about in that particular annotation.第三列中的值是对特定注释中描述和/或谈论的特定字符的引用。

Data sets 2 and 3, look the same, although the values for all three columns will are different.数据集 2 和 3 看起来相同,尽管所有三列的值都不同。 Importantly , data sets 2 and 3 come from the same recording.重要的是,数据集 2 和 3 来自同一记录。 Thus I have three "channels" in which the same characters are represented/discussed but not always at the same time.因此,我有三个“频道”,在其中展示/讨论相同的角色,但并不总是同时出现。

If the data were to successfully imported into R and visualized on a time scale, it should look something like this:如果数据成功导入 R 并在时间尺度上可视化,它应该看起来像这样:

所需可视化的草图

The Y axis would be the three different data sets or 'channels' and the X axis would be the duration of the entire recording. Y 轴将是三个不同的数据集或“通道”,X 轴将是整个记录的持续时间。 The data points thus plotted here would represent the annotations and when they begin and end.此处绘制的数据点将代表注释及其开始和结束时间。

Question #1问题#1

How do I appropriately structure my data so that a particular value (A,B, or C), is bound to its specific start and stop times?我如何适当地构建我的数据,以便将特定值(A、B 或 C)绑定到其特定的开始和停止时间? I imagine that embedded vectors are involved but I'm not sure how to set it up.我想象涉及嵌入式矢量,但我不确定如何设置它。

Once that is accomplished, what would be the appropriate plot to use to visualize the data and confirm that it's showing what I want to show?完成后,使用什么 plot 来可视化数据并确认它显示的是我想要显示的内容? Something like a mosaic plot perhaps?也许像马赛克 plot?

Question #2问题2

When the data is set up appropriately, I want to investigate when the character values (A, B,C) align or do not align with the same values in the other channels (1, 2, 3).适当设置数据后,我想调查字符值(A、B、C)何时与其他通道(1、2、3)中的相同值对齐或不对齐。 How would I go about doing that?我将如何 go 这样做? I suppose I would need one of the three channels (1, 2, or 3) to serve as a sort of anchor point.我想我需要三个通道(1、2 或 3)中的一个作为一种锚点。 Thus, something like:因此,类似于:

"For every A value in channel 1, what values overlap with it in channels 2 and 3?" “对于通道 1 中的每个 A 值,通道 2 和 3 中有哪些值与其重叠?”

I would also like to have the ability to point to query a specific data point in a given channel and pull up its value as well as the values of the data points in the other channels that co-occur with it.我还希望能够指向查询给定通道中的特定数据点并提取其值以及与它同时出现的其他通道中的数据点的值。 Thus, I should be able to pull up data point #15 in channel 2 and get its value (A, B, or C) as well as the number of data points that co-occur within the window of data point #15's duration in the other channels (and their values).因此,我应该能够在通道 2 中提取数据点 #15 并获取其值(A、B 或 C)以及在数据点 #15 的持续时间的 window 中同时出现的数据点数量其他渠道(及其价值)。

This question is a bit exploratory in nature, and any and all answers, suggestions, feedback to either question would be most appreciated.这个问题本质上有点探索性,非常感谢对这两个问题的任何和所有答案、建议和反馈。

Not sure about question 2 off-hand, but for question 1...不确定问题 2 副手,但对于问题 1...

I think the current format is fine, though in R you will likely want it in one frame (instead of 3) where the dataset name (and/or number, extracted) is a column.我认为当前格式很好,但在 R 中,您可能希望它出现在一帧(而不是 3 帧)中,其中数据集名称(和/或提取的数字)是一列。 For instance, if your file above is in file1.csv , then the others might also be in file2.csv and file3.csv .例如,如果您上面的文件在file1.csv中,那么其他文件也可能在file2.csvfile3.csv中。 Reading and combining them can be done with:阅读和组合它们可以通过以下方式完成:

allfiles <- lapply(setNames(nm = list.files(pattern="csv$")), read.csv)
alldat <- dplyr::bind_rows(allfiles, .id = "dataset")
alldat
#      dataset start.time stop.time char
# 1  file1.csv      0.000     9.719    A
# 2  file1.csv      9.719    11.735    B
# 3  file1.csv     11.735    14.183    A
# 4  file1.csv     14.183    16.554    C
# 5  file1.csv     16.554    18.482    A
# 6  file1.csv     18.482    19.553    B
# 7  file2.csv      0.000    11.693    A
# 8  file2.csv     11.693    12.310    B
# 9  file2.csv     12.310    13.912    A
# 10 file2.csv     13.912    15.406    C
# 11 file2.csv     15.406    16.988    A
# 12 file2.csv     16.988    19.553    B
# 13 file3.csv      0.000     7.777    A
# 14 file3.csv      7.777    12.920    B
# 15 file3.csv     12.920    15.449    A
# 16 file3.csv     15.449    15.920    C
# 17 file3.csv     15.920    20.042    A
# 18 file3.csv     20.042    19.553    B

(I jittered the first dataset into files 2 and 3.) (我将第一个数据集抖动到文件 2 和 3 中。)

From here, plotting with ggplot2 is not too difficult:从这里开始,用 ggplot2 绘图并不太难:

library(ggplot2)
ggplot(alldat, aes(fill = char, color = char)) +
  geom_rect(aes(xmin = start.time, xmax = stop.time, ymin = -0.5, ymax = 0.5)) +
  geom_text(aes(x = pmin(start.time, (start.time+stop.time)/2),
                y = 0, label = char),
            hjust = -0.5, vjust = 0.5,
            inherit.aes = FALSE) +
  scale_x_continuous(name = "Time (min)") +
  facet_grid(dataset ~ .) +
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank() )

ggplot2 与 geom_rect

The plot could be improved by nuancing the hjust= (horizontal justification, ie, -0.5 shifts the letters half a letter to the right of the start.time value) in the narrow bands. plot 可以通过细微调整窄带中的hjust= (水平对齐,即-0.5将字母向start.time值右侧移动半个字母)进行改进。 Other areas of improvement are mostly addressed by theme(..) , eg, removing the y-axis minor grid lines in the background, limiting the x-axis expansion, placement (or removal) of the legend, all of which are standard ggplot2 operations and should be easy enough to research and apply.其他方面的改进主要由theme(..)解决,例如,删除背景中的 y 轴次要网格线,限制 x 轴扩展,图例的放置(或删除),所有这些都是标准的 ggplot2操作,并且应该足够容易研究和应用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM