简体   繁体   中英

Matching of patterns in R

I have a data.frame with two rows and 20 columns where each column holds one character, which roughly looks like this (columns scrunched here for clarity):

        Cols 1-20
  row1  ghuytuthjilujshdftgu 
  row2  ghuytuthjilujshdftgu

I want a mechanism for comparing these two strings character by character (column by column) starting from position 10 and scanning outwards, returning the number of matching characters until the first difference is encountered. In this case it is obvious that both lines are identical so the answer would be 20. The important thing would be that even if they are completely identical, as in the case above, there should not be an error message (it should be returned).

With this alternate example, the answer should be 12:

    Cols 1-20
row1  ghuytuthjilujshdftgu 
row2  XXXXXXXXjilujshdftgu

Here is some code to generate the data frames:

r1 <- "ghuytuthjilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df1 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r1, ""))))

r1 <- "ghuytuthjilujshdftgu"
r2 <- "XXXXXXXXjilujshdftgu"
df1 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r1, ""))))

Edit.

the class of the object is data.frame and it is subsettable- with dim = 2,20 (each column / character is accessible on its own)

Here is an answer that splits the df into two pieces (left and right from center, reordering left so that it counts from center to first value), and then counts length by using cumsum and NA, so that cumsum turns to NA as soon as there is a mismatch, and then finds the highest index value that is not NA to represent the longest stretch starting from center without a mismatch.

sim_len <- function(df, center=floor(ncol(df) / 2)) {
  dfs <- list(df[, max(center, 1):1, drop=F], df[, center:ncol(df), drop=F])
  df.count <- lapply(dfs, function(df) {
    diff <- cumsum(ifelse(df[1, ] == df[2, ], 1, NA_integer_))
    diff[max(which(!is.na(diff)))]
  })
  max(0L, sum(unlist(df.count)) - 1L)  
}

And here are some examples of running it (the as.data.frame business is just creating the data frame from the character strings. Note that the "center" column is counted twice, hence the -1L in the final line of the function.

r1 <- "ghuytuthjilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df1 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r1, ""))))
sim_len(df1)
# [1] 20

r1 <- "ghuytut3jilujshdftgu"
r2 <- "ghuytuthjilujshdftgu"
df2 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))
sim_len(df2)
# [1] 12

r1 <- "ghuytut3jilujshdftgu"
r2 <- "ghuytuthjilujxhdftgu"
df3 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))
sim_len(df3)
# [1] 5

r1 <- "ghuytut3xilujshdftgu"
r2 <- "ghuytuthjixujxhdftgu"
df4 <- as.data.frame(rbind(unlist(strsplit(r1, "")), unlist(strsplit(r2, ""))))
sim_len(df4)
# [1] 1

A variation that reports both left and right counts. Note that the "center" is counted in both left and right, so sum of left + right is 1 greater than what reported by original function:

sim_len2 <- function(df, center=floor(ncol(df) / 2)) {
  dfs <- list(left=df[, max(center, 1):1, drop=F], right=df[, center:ncol(df), drop=F])
  vapply(dfs, 
    function(df) {
      diff <- cumsum(ifelse(df[1, ] == df[2, ], 1, NA_integer_))
      diff[max(which(!is.na(diff)))]
      },
      numeric(1L)
) }
sim_len2(df1)
# left right 
#   10    11
sim_len2(df4, 4)
# left right 
#    4     4 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM