Faster way to run a for loop in R for this?

Question

So, this is what my dataframe looks like:

Product_Code      Publisher    Published_Date
AB1F                  A            2011
AB1F (A Version)      A            1999
TG1F (B Version)      B            2001
AB1Z (A Version)      A            2003
TG1F                  B            2006
GX1T                  C            2011

with about 1.3 million rows.

What I'm trying to do is for rows with the same Publisher, I would use grep() in Product_Code to find rows with the same Product Code regardless to what Versions they are. And set them to have the oldest Published_Date.

So the result will look like this:

Product_Code      Publisher    Published_Date
AB1F                  A            1999
AB1F (A Version)      A            1999
TG1F (B Version)      B            2001
AB1Z (A Version)      A            2003
TG1F                  B            2001
GX1T                  C            2011

I tried

for (n in 1:nrow(df)) {
   A=which(grepl(df[n,1],df[,1])==TRUE & df[n,2]==df[,2])
   min.date=min(df[A,3])
   df[A,3]=min.date
}

I am not sure if this for loop code even works because my computer will never finish running the code.

Any help will be appreciated!

Answer 1

We can use data.table . Convert the 'data.frame' to 'data.table' ( setDT(df1) ). We remove the substring that matches the space followed by ( followed by one of more characters using sub , use that as a grouping variable, if there are any ( character in the 'Product_Code', then we match 'A', 'B' with the substring from 'Product_Code', remove the NAs, use that to subset the 'Published_Date', get the min of that or else return the 'Published_Date' and assign ( := ) it to 'Published_Date'.

library(data.table)
setDT(df1)[, Published_Date := if(any(grep("\\(", Product_Code))) 
  min(Published_Date[na.omit(match(c("A", "B"), sub(".*\\((.).*", "\\1", Product_Code)))])
   else Published_Date , by = .(grp=sub("\\s+.*", "", Product_Code))]
     Product_Code Publisher Published_Date
#1:             AB1F         A           1999
#2: AB1F (A Version)         A           1999
#3: TG1F (B Version)         B           2001
#4: AB1Z (A Version)         A           2003
#5:             TG1F         B           2001
#6:             GX1T         C           2011

Or with dplyr , we separate the 'Product_Code' into two columns ("Product", "Version"), grouped by "Product", we mutate the 'Published_Date' based on an if/else condition.

library(dplyr)
library(tidyr)
df1 %>% 
    separate(Product_Code, into = c("Product", "Version"), remove=FALSE) %>%
    group_by(Product) %>% 
    mutate(Published_Date = if(all(is.na(Version))) Published_Date
          else min(Published_Date[Version == Publisher & !is.na(Version)])) %>%
    ungroup() %>%   
    select(-Product, - Version)
#      Product_Code Publisher Published_Date
#             <chr>     <chr>          <int>
#1             AB1F         A           1999
#2 AB1F (A Version)         A           1999
#3 TG1F (B Version)         B           2001
#4 AB1Z (A Version)         A           2003
#5             TG1F         B           2001
#6             GX1T         C           2011

Instead of separate , we can also use extract to avoid the warning message

df1 %>% 
   extract(Product_Code, into = c("Product", "Version"), 
                     "(\\S+)\\s*\\(*(\\S*).*", remove = FALSE)%>%
   group_by(Product) %>%
   mutate(Published_Date = if(all(!nzchar(Version))) Published_Date
      else min(Published_Date[Version == Publisher])) %>%
   ungroup() %>%
   select(-Product, -Version)
#     Product_Code Publisher Published_Date
#             <chr>     <chr>          <int>
#1             AB1F         A           1999
#2 AB1F (A Version)         A           1999
#3 TG1F (B Version)         B           2001
#4 AB1Z (A Version)         A           2003
#5             TG1F         B           2001
#6             GX1T         C           2011

Update

If there are no specific patterns, we can create a ( for elements that don't have ( and have more than 1 word

df1$Product_Code <- sub("\\s+\\(*", " (", df1$Product_Code)

and use the above codes.

Faster way to run a for loop in R for this?

Question

1 answers

solution1
3 ACCPTED 2016-06-25 08:43:08

Update

Faster way to run a for loop in R for this?

Question

1 answers

solution1 3 ACCPTED 2016-06-25 08:43:08

Update

solution1
3 ACCPTED 2016-06-25 08:43:08