R中具有重叠计数的编码矩阵

Question

I am proficient in Python but a complete novice in R. I can't find an answer to this question elsewhere online, and whilst it's going to be a bit lengthy, I am hoping it will be useful to other users of the R library RQDA . 我精通Python，但却是R语言的新手。我在网上其他任何地方都找不到这个问题的答案，尽管这会有点冗长，但我希望它对R库RQDA的其他用户有用。

Essentially, RQDA is a qualitative research tool, that is primarily used for assigning codes (themes) to text files. 本质上，RQDA是一种定性研究工具，主要用于为文本文件分配代码（主题）。 It's a bit like a highlighter pen that counts where it has highlighted. 它有点像一支荧光笔，可以计算出突出显示的位置。

If you put in a lot of files, you can code the text in different places with themes (eg a project about interviewing people working in cloth manufacturing might be "equipment", "sewing", "linen", "silk", "lighting", "lunch breaks", etc). 如果您放入大量文件，则可以在不同位置使用主题编码文本（例如，有关采访布料生产人员的项目可能是“设备”，“缝纫”，“亚麻”，“丝绸”，“照明” ”，“午餐休息时间”等）。 This enables you to count how many times different codes were used, and in RQDA it gives a table output as follows: 这使您可以计算使用了不同代码的次数，并且在RQDA中它给出了如下表输出：

rowid   cid fid codenamefilename    index1  index2  CodingLength
1   1   12  1   silk    2010-01-28  409     939     530
2   2   21  1   cotton  2010-01-28  1008    1172    164
3   3   12  1   silk    2010-01-28  1173    1924    751
4   4   39  1   sewing  2010-01-28  1008    1250    751
5   5   38  1   weaving 2010-01-28  1173    1924    751
6   6   78  1   costs   2010-01-28  727     939     212
7   7   23  1   lunch   2010-01-28  1553    1788    235
8   9   7   2   lunch   2010-01-29  1001    1230    371
9   10  4   2   weaving 2010-01-29  1547    1724    135
10  11  6   2   social  2010-01-29  1001    1290    350
11  12  7   2   silk    2010-01-29  1926    2276    350
12  14  17  2   supply  2010-01-29  1926    2276    350
13  15  78  2   costs   2010-01-29  1926    2276    350
14  17  78  2   weaving 2010-01-29  1890    2106    212

codename = code the text was given (theme) 代号=给出文字的代码（主题）

filename = filename of text (in this case, date of diary entry) filename =文本的文件名（在这种情况下，是日记输入的日期）

index1 = character position in file where code starts (highlighted text) index1 =代码在文件中的字符位置（突出显示的文本）

index2 = character position in file where code ends (highlighted text) index2 =代码在文件中的字符位置（突出显示的文本）

CodingLength = overall length of coded/highlighted text CodingLength =编码/突出显示的文本的总长度

What I'd like to do is to iterate over the entire table (around 1,500 rows) with the total list of codes (codename in the table above, around 100 unique codes) in order to output a 2-way matrix of overlap between codes, for example (indicative only, with 5 codes): 我想做的是用整个代码列表（上表中的代号，大约100个唯一代码）遍历整个表（约1,500行），以便输出代码之间重叠的2向矩阵，例如（仅指示性，带有5个代码）：

    silk    cotton  sewing  weaving lunch breaks    socialising
silk    *     0      0       3       2              0
cotton  0     *      5       0       0              0
sewing  0     5      *       0       0              0
weaving 3     0      0       *       0              0
lunchs  2     0      0       0       *              5
socialg 0     0      0       0       5              *

(Code messed up a bit on this output but hopefully you get the idea) （此输出的代码有些混乱，但希望您能理解）

Therefore, in RI need a bit of code that will iterate over the code list and count the number of instances where A) filename is the same and B) there is overlap in the range between index1 and index2 (CodingLength probably not important). 因此，在RI中需要一些代码，该代码将在代码列表上进行迭代并计算A）文件名相同且B）在index1和index2之间重叠的实例数（CodingLength可能不重要）。

Apart from the following vague hunches I am lost as to exactly how to make this work: 除了以下模糊的预感之外，我还不清楚如何进行这项工作：

I probably need to asign the table as a variable eg: 我可能需要将表分配为变量，例如：
coding_table <- getCodingTable() encoding_table <-getCodingTable（）
I probably need to make a list of the unique variables eg: 我可能需要列出唯一变量，例如：
x = c("silk","cotton","weaving","sewing","lunch" ... etc. ) x = c（“丝绸”，“棉”，“织造”，“缝制”，“午餐” ...等）
I need a function that does the checks 我需要一个执行检查的功能
I need a for-loop for the rows 我需要一个for循环的行
I need a boolean test where the range and file name is checked eg any(409:939 %in% 727:939) && filename == filename 我需要一个布尔测试，其中检查范围和文件名，例如any（409：939％in％727：939）&& filename == filename

Based on this, can anyone see a way to produce a very short solution to this? 基于此，任何人都可以找到一种方法来解决这个问题吗？ I feel like the equivalent in python would be 10 lines maximum, but given the extra bits required in RI am completely lost as to how to do this. 我觉得python中的等效项最多为10行，但是鉴于RI中所需的额外位完全不知道如何执行此操作。

Answer 1

You can use the foverlap function in the data.table package to create an edgelist and then turn this into a weighted adjacency matrix. 您可以使用data.table包中的foverlap函数创建data.table ，然后将其转换为加权邻接矩阵。 (See here ). （请参阅此处）。

Using a combination of data.table , dplyr , and igraph , I think this gets you what you want (can't verify without data, though). 通过结合使用data.table ， dplyr和igraph ，我认为这可以为您提供所需的东西（尽管没有数据也无法验证）。

First, you set your data frame as a data table and set the key for index1 and index2. 首先，将数据帧设置为数据表，并设置index1和index2的键。 Then, foverlap identities entries where index1 and index2 have any overlap. 然后，其中foverlap和foverlap有任何重叠的foverlap身份条目。 After eliminating self-overlaps, replace the ids generated by foverlaps with corresponding codenames from the data set. 消除自重叠后，用数据集中的相应代号替换由重叠产生的ID。 This creates an edgelist. 这将创建一个边缘列表。 Pass this edgelist to igraph to create an igraph object and return it as an adjacency matrix. 将此边缘列表传递给igraph以创建igraph对象，并将其作为邻接矩阵返回。

require(igraph); require(data.table); require(dplyr)

el <- setkey(setDT(coding_table), filename, index1, index2) %>%
  foverlaps(., ., type="any", which=TRUE) %>%
  .[coding_table$codename[xid] != coding_table$codename[yid]] %>%
  .[, `:=`(xid = coding_table$codename[xid], yid = coding_table$codename[yid])]

m <- as.matrix(get.adjacency(graph.data.frame(el)))

Of course, dplyr is totally optional; 当然， dplyr完全是可选的。 the piping just makes it a bit neater and avoids creating more objects in the environment. 管道只会使其变得更整洁，并避免在环境中创建更多对象。

Answer 2

Another approach that seems valid, as I understand your description. 据我了解您的描述，另一种方法似乎有效。

Find overlaps using the "IRanges" package: 使用“ IRanges”包查找重叠：

fo = findOverlaps(IRanges(dat$index1, dat$index2))

Check whether the overlapped ranges belong to the same "filename": 检查重叠的范围是否属于相同的“文件名”：

i = dat$filename[queryHits(fo)] == dat$filename[subjectHits(fo)]

And, tabulate the "codename" for the overlapped "index1" and "index2" belonging to the same "filename": 并且，将属于相同“文件名”的重叠“ index1”和“ index2”的“ codename”制成表格：

table(dat$codename[queryHits(fo)[i]], dat$codename[subjectHits(fo)[i]])
#       
#          costs cotton lunch sewing silk social supply weaving
#  costs       2      0     0      0    2      0      1       1
#  cotton      0      1     0      1    0      0      0       0
#  lunch       0      0     2      0    1      1      0       1
#  sewing      0      1     0      1    1      0      0       1
#  silk        2      0     1      1    3      0      1       2
#  social      0      0     1      0    0      1      0       0
#  supply      1      0     0      0    1      0      1       1
#  weaving     1      0     1      1    2      0      1       3

R中具有重叠计数的编码矩阵

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-11-07 19:50:03

解决方案2
2 2016-11-08 14:57:30

R中具有重叠计数的编码矩阵

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-11-07 19:50:03

解决方案2 2 2016-11-08 14:57:30

解决方案1
2 已采纳 2016-11-07 19:50:03

解决方案2
2 2016-11-08 14:57:30