简体   繁体   English

计算R中其他字符串之后的字符串数

[英]Counting the number of strings preceded by other strings in R

I have a list of txt data files. 我有一个txt数据文件列表。 Each of them is a record of all actions participants took in a set of tasks. 每个记录都记录了参与者在一组任务中采取的所有行动。 An example piece of data for one task is: 一个任务的示例数据是:

[245] "2015-02-20 11:11:02|    134602| end of mat task (passed: 4/5)"                                                                                    
[246] "2015-02-20 11:11:02|    134599| step E9 abandoned - skipping to next"                                                                             
[247] "2015-02-20 11:11:01|    133596| step E9 bad choice - error limit reached"                                                                         
[248] "2015-02-20 11:10:47|    120007| intruder D started"                                                                                               
[249] "2015-02-20 11:10:47|    119792| step E9 bad choice"                                                                                               
[250] "2015-02-20 11:10:38|    110857| step E9 started"                                                                                                  
[251] "2015-02-20 11:10:37|    109844| step E1 success"                                                                                                  
[252] "2015-02-20 11:10:28|    101030| step E1 started"                                                                                                  
[253] "2015-02-20 11:10:27|    100018| step D10 success"                                                                                                 
[254] "2015-02-20 11:10:07|     79625| step D10 started"                                                                                                 
[255] "2015-02-20 11:10:06|     78609| step C12 success"                                                                                                 
[256] "2015-02-20 11:10:02|     74713| step C12 bad choice"                                                                                              
[257] "2015-02-20 11:09:50|     62673| step C12 started"                                                                                                 
[258] "2015-02-20 11:09:49|     61642| step B8 success"                                                                                                  
[259] "2015-02-20 11:09:47|     60003| intruder B started"                                                                                               
[260] "2015-02-20 11:09:33|     46047| step B8 started"                                                                                                  
[261] "2015-02-20 11:09:33|     46032| mats: B8,C12,D10,E1,E9"                                                                                           
[262] "2015-02-20 11:09:33|     46032| mat task: B8,C12,D10,E1,E9 displayed..."  

Now, for each element of my list I need to count the number of times the "bad choice" message is displayed but only when it appears right after "success", so I need to count the number of instances a person made a mistake and successfully corrected it (the data is saved from bottom to the top, so newer events are above the older ones). 现在,对于列表中的每个元素,我需要计算“错误选择”消息的显示次数,但是仅当消息出现在“成功”之后时才显示,因此我需要计算一个人犯错的实例数,并且成功更正了该错误(数据从下到上保存,因此新事件高于旧事件)。

And secondly, there are some intruder tasks in the procedure that are randomly activated and it is possible that a message starting with "intruder..." (eg "intruder B started") might appear between a "bad choice" and "success" messages (it is not the case in the example above but it can happen in the data). 其次,该过程中有一些入侵者任务是随机激活的,并且可能会在“错误选择”和“成功”之间出现以“入侵者...”开头的消息(例如“入侵者B已启动”)。消息(在上面的示例中不是这种情况,但是可能会在数据中发生)。 So I need to include also the instances in which the "intruder..." message (but not other messages)appear between the two messages in question. 因此,我还需要包括两个实例中出现“入侵者...”消息(而不是其他消息)的实例。

I would appreciate any tips on how to handle this problem the best way. 我将不胜感激有关如何以最佳方式处理此问题的任何技巧。

Here with some dummy data... Should give you an idea on your first part of the question. 这里有一些虚拟数据...应该让您对问题的第一部分有所了解。

lines <- c("2015-02-20 11:11:02|    134602| end of mat task (passed: 4/5)",
           "2015-02-20 11:11:02|    134599| step E9 abandoned - skipping to next",
           "2015-02-20 11:11:01|    133596| step E9 bad choice - error limit reached",
           "2015-02-20 11:10:38|    110857| step E9 started",
           "2015-02-20 11:10:37|    109844| step E1 success",
           "2015-02-20 11:10:02|     74713| step C12 bad choice")
grep('bad choice', lines[grep('success', lines) + 1], value=TRUE)

The second part might be similar, just spread the one line into severals, where you check for your "intruders" and if so, just add or subtract the 1. 第二部分可能是相似的,只需将一行分散成几行,即可检查“入侵者”,如果是,则只需加减1。

As antoine-sac suggested in the comments, you can remove the intruders upfront by using 正如评论中建议的antoine-sac ,您可以使用来预先删除入侵者

tmp <- lines[!grepl(lines, "^intruder.+started$"]
grep('bad choice', tmp[grep('success', tmp) + 1], value=TRUE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM