Matching regexes

Question

I have this code:

library(stringr)
library(devtools)

full_patterns <- source_gist("446417161352179ce42c")$value
literal_strings <- source_gist("21f5cf342e20c6e4a1e8")$value
literal_strings <- literal_strings[order(nchar(literal_strings), decreasing = TRUE)]

regex_list <- list()
for (i in 1:length(literal_strings)){
  regex_list[i] <- paste0("(?<=", literal_strings[i], "?)(?:I\\d-?)*I3(?:-?I\\d)*")
}

IVs_identified <- list()
DVs_identified <- list()

for (i in 1:length(regex_list)){
  DVs_identified[[i]] <- lapply(full_patterns, str_extract_all, regex_list[[i]])
  IVs_identified[[i]] <- lapply(full_patterns, str_extract_all, literal_strings[[i]])
}

data.frame(unlist(DVs_identified), unlist(IVs_identified))

length(unlist(DVs_identified))
length(unlist(IVs_identified))

The point of the code is to generate a data.frame with two columns. The first column should contain the first part of the regex match (contained in literal_strings ). The second column should have the second part of the regex match (ie (?:I\\\\d-?)*I3(?:-?I\\\\d)* , but only if it is preceded by the appropriate literal string). The second part of the regex matches the specifications described here . In short: it is an uninterrupted sequence of markers (ie I1 , I2 , and I3 ) that only contains IX markers, and where I3 at least occurs once. In other words, markers such as FA does not occur inside of this sequence.

To make this work the line literal_strings <- literal_strings[order(nchar(literal_strings), decreasing = TRUE)] is crucial. This orders the literal strings so that the longer strings come first. This is because the intention is that once a section of full_patterns is matched, it should be ignored. For example, the longest literal_string is IFA-NR-TR-TR-FA,TR-NR-FA-NR-NR-QU-QU-NR-IFA-EX-TR-NR-FA-QU-I2-EX-II2-NR-TR-TR-I2-EX-NR-QU-EX-I2,QU-TR-NR-QU-NR-FA-TR-QU-EX-II2-I2-I2-I2-II2-FA-EX-TR-TR-QU-NR-NR-NR-TR-I2-FA-QU-ITR-EX-FA,TR-I2-NR-QU-FA-IFA-TR-EX-NR-FA-NR-FA-EX-FA-FA-QU-NR-NR-NR-INR-TR and one of the shortest is FA . However, at this point (towards the end of the process) we are not interested in matching the single FA markers that were already matched inside of previous literal_strings.

As you can see, the code doesn't work because the two lists that are generated are of different lengths - they need to be of the exact same length. How can I accomplish this?

For debugging (since running this on R 3.1.2 does not seem to work): My sessionInfo() gives:

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.0.0

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.2.0   stringi_0.4-1

Answer 1

Take a look at this:

library(stringr)
library(devtools)
library(Hmisc)


full_patterns <- c("I2-EX-I3-EX-I2-IEX-I3-I2-EX-I2-I2-II3-I2-III2-I2-I3-INR-FA-NR-I3-INR-IEX-QU-I3-NR-FA-EX-QU-NR-I2-I2-I2-NR-TR-II2-I3-NR-IIEX")
#full_patterns <- c("I2-EX-I3-EX-I2-IEX-I3-I2-EX-I2-I2-II3-I2-III2-I2-I3-INR-FA-NR-I3-INR-IEX-QU-I3-NR-FA-EX-QU-NR-I2-I2-I2-NR-TR-II2-I3-NR-IIEX-NR-NR-INR-NR-I3-I2-NR-IQU-QU-ITR-QU-NR-NR-QU-TR-NR-ITR-IFA-II2-QU-TR-FA-EX-QU-QU-QU-NR-QU-ITR-FA-QU-FA-FA-TR-FA-QU-EX-QU-IQU-QU-FA-FA-QU-QU-FA-FA-I3-NR-FA-II2-FA-QU-FA-I2-FA-NR-INR-TR-NR-EX-NR-NR-EX-TR-I3-INR-NR-FA-ITR-EX-NR-NR-IINR-INR-EX-EX-EX-NR-NR-NR-FA-FA", "FA-I2-I2-I2-EX-I2-I3-FA-II2-TR-II2-FA-I3-IFA-FA-NR-I3-I2-TR-II2-II2-FA-I2-II3-FA-QU-II2-I2-I2-NR-I2-I2-NR-II2-INR-I3-QU-I2-I3-QU-NR-I2-INR-QU-QU-I2-IEX", "FA-FA-ITR-IIFA,TR-FA-I2-I2-FA-EX-IFA,IEX,I2-I2-INR-I2-I3-I1,TR-NR-I2-I3-EX-IQU-TR-I3-NR-EX-I3-EX,I2-EX-IIIII2-II3-I2-EX,FA-IEX-EX-TR-EX-TR-I3-INR-I2-FA-FA-TR-I2-IIIIIFA-I2-FA-TR-III3-NR-FA-III3-TR-I2-I2,I2-I2-EX,TR-TR-I2-FA-I2-I3-IIIFA-ITR-FA-IFA-INR-NR-II2-I3-I2-FA-II2-EX-FA,I3-I3-TR-I3-FA-NR-II2-II3-TR-TR-EX,I3-TR-NR-TR-QU-EX-NR-TR-I2-EX-III3-INR-INR-IFA,TR-I3-I2-I3-NR-NR-I1,IIFA-FA-IFA-FA-NR-II3-NR-I2-FA-FA-IFA-NR-FA,IFA-FA-NR-NR-I2-NR-IIIFA-EX,II2-II2-I2-QU-TR-FA-QU-I3-EX-ITR-IFA-FA-NR-INR-FA-FA-EX-II2-NR-I3,I3-FA-I2-I2-FA-I2-FA-I2,I2-INR-I2-NR-II3-TR-FA-I2-I3,I3-NR-EX-TR-IEX,II2-FA-I2-INR-I2-I3-IIEX-FA,IEX-EX-EX-EX-EX-EX-EX-TR-TR-I2-NR-NR-EX-NR-I3-FA-NR-NR-NR-EX-NR-II2-IIFA-FA-ITR-NR-I2-I3-I2-NR-FA-NR-I1")
literal_strings <- c("I2")
#literal_strings <- c("FA-QU-II2-I2-I2-NR-I2-I2-NR-II2-INR-", "QU-I2-", "QU-NR-I2-INR-QU-QU-I2-IEX-", "FA-", "QU-EX-NR-", "NR-EX-", "NR-EX-TR-", "QU-")
#full_patterns <- source_gist("446417161352179ce42c")$value
#literal_strings <- source_gist("21f5cf342e20c6e4a1e8")$value
escaped_literals <- lapply(literal_strings, escapeRegex)

regex_list <- list()
for (i in 1:length(literal_strings)){
  regex_list[i] <- paste0("(?:(?=", escapeRegex(literal_strings[i]), ")(?:I\\d-?)*I3(?:-?I\\d)*|(?=", escapeRegex(literal_strings[i]), "))")
}

IVs_identified <- list()
DVs_identified <- list()

for (i in 1:length(regex_list)){
  DVs_identified[[i]] <- lapply(full_patterns, str_extract_all, regex_list[[i]])
  IVs_identified[[i]] <- lapply(full_patterns, str_extract_all, escaped_literals[[i]])
}

unlistDVs <- unlist(DVs_identified)
unlistIVs <- unlist(IVs_identified)

for(i in 1:length(unlistDVs))
{
  print(unlistDVs[i])
  flush.console()
}

print("---------------------")

for(i in 1:length(unlistIVs))
{
  print(unlistIVs[i])
  flush.console()
}



data.frame(unlist(DVs_identified), unlist(IVs_identified))

print(length(unlist(DVs_identified)))
print(length(unlist(IVs_identified)))

I've stripped the data down in the sample above to identify what (I believe) is causing the discrepancies. The reason why this is not working should become obvious. In the small sample set I've set up, the sixth I2 gets matched, but because of the way it correctly matches the regex I2-I2-I3 , It skips over a literal_string match (there are two I2 's in one legal regex match). Obviously, this is just an example, but I think it's pretty easy to see that occurring in other cases.

I think the way I've structured the regexes is correct, the issue is that the optional part of the regex you've provided (?:I\\\\d-?)*I3(?:-?I\\\\d)* can sometime match multiple literal_string matches which causes a discrepancy. I've spent more time on this than is probably reasonable, so unless there is something I'm missing, I'm probably going to bow out.

Matching regexes

Question

1 answers

solution1
0 2015-08-05 17:11:04

Matching regexes

Question

1 answers

solution1 0 2015-08-05 17:11:04

solution1
0 2015-08-05 17:11:04