简体   繁体   中英

Efficient way to iterate over two lists (nested loop alternative)

I have two data frames, SCR and SpecificSpecies. The names of items in SCR contain in part the species listed in Specific Species.

SpecificSpecies$Species
S cerevisiae
Daucus carota

SCR$MESH_HEADINGS
tetracycline CMT-3 
zrg17 protein, S cerevisiae
EP4 glycoprotein, Daucus carota

I am trying to get subset of SCR that contain just those entries which do not have any matching species. In the above case, that list would be just

tetracycline CMT-3.

The way I learned to do this would be using nested loops, comparing every entry in SCR to every entry in SpecificSpecies. When no match is found, append the row of SCR to a new table:

For each row in SCR {
  SpeciesNumber <- 1
  match <-NULL
  while ((is.null(match)) & (SpeciesNumber < length(SpecificSpecies$Species))) {
  if (grepl(SpecificSpecies$Species[SpeciesNumber], SCR[row,]$MESH_HEADING)){
    match <- TRUE}
  SpeciesNumber <- SpeciesNumber + 1}
  if ((is.null(match) & SpeciesNumber == length(SpecificSpecies$Species)) {
    speciesNoMatch = rbind(speciesNoMatch, SCR[row])}
}}

But this is excruciatingly slow with 65,000 entries in SCR and about 1500 in SpecificSpecies. Is there a way to nest like this with lapply? Or some other function that will help here that I am unfamiliar with?

I'm sure this is terrible code to begin with. I'm a medical librarian who has to use R sometimes for data analysis, so I have very limited programming skills to make do, but usually it doesn't matter if my solutions are ugly or inefficient as long as they eventually work. I know there must be a better way to do this; forgive me for being ignorant of something that is probably a simple solution.

I think !(%in%) will do the trick:

SpecificSpecies <- data.frame(
  Species = c("S cerevisiae", "Daucus carota"),
  stringsAsFactors = FALSE
)

SCR <- data.frame(
  MESH_HEADINGS = c("tetracycline CMT-3", "zrg17 protein", "S cerevisiae", 
                    "EP4 glycoprotein", "Daucus carota"),
  stringsAsFactors = FALSE
)


SCR[!(SCR$MESH_HEADINGS %in% SpecificSpecies$Species), , drop = FALSE]
#        MESH_HEADINGS
# 1 tetracycline CMT-3
# 2      zrg17 protein
# 4   EP4 glycoprotein

The , , drop = ... isn't a typo. The first , ensures all columns/variables are returned. The second , drop = FALSE ensures the returned result is still a data frame.

Correction

Ok, I've just noticed you're looking to grep with the Species . The following code should work:

SpecificSpecies <- data.frame(
  Species = c("S cerevisiae", "Daucus carota"),
  stringsAsFactors = FALSE
)

SCR <- data.frame(
  MESH_HEADINGS = c("tetracycline CMT-3",
                    "zrg17 protein, S cerevisiae", 
                    "EP4 glycoprotein, Daucus carota"),
  stringsAsFactors = FALSE
)

matching <- lapply(SpecificSpecies$Species, function(x) {
  grep(x, SCR$MESH_HEADINGS)
})

SCR[-(unlist(matching)), ]
#        MESH_HEADINGS
# 1 tetracycline CMT-3

The lapply() uses an anonymous function to identify pattern matches. It loops through every species and compares it to every SCR$MESH_HEADINGS item. It returns a list of matched indices.

The subset ( [] ) simply drops the matched indices ( - ) after we've first unlist ed the matched indices to make it compatible with the subset function.

Main idea:

Doing the loop on SpecificSpecies as it has less row. Since the SCR dataframe will be reduce, do it recursively, so the loop work on less data each time.

In general the packages data.table or plyr increase performance. Here the solution with data.table

    library(data.table)
SpecificSpecies <- data.frame(Species = c("S cerevisiae", "Daucus carota"),stringsAsFactors = FALSE)
SCR <- data.frame(MESH_HEADINGS = c("tetracycline CMT-3", "zrg17 protein, S cerevisiae","EP4 glycoprotein Daucus carota"),stringsAsFactors = FALSE)

dt_temp <- data.table(SCR)
for (species in SpecificSpecies$Species) {
  dt_temp <- dt_temp[!grepl(species,dt_temp$MESH_HEADINGS), ]
}
dt_result <- dt_temp
dt_result

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM