为此在R中运行for循环的更快方法？

Question

So, this is what my dataframe looks like: 所以，这就是我的数据帧：

Product_Code      Publisher    Published_Date
AB1F                  A            2011
AB1F (A Version)      A            1999
TG1F (B Version)      B            2001
AB1Z (A Version)      A            2003
TG1F                  B            2006
GX1T                  C            2011

with about 1.3 million rows. 大约有130万行。

What I'm trying to do is for rows with the same Publisher, I would use grep() in Product_Code to find rows with the same Product Code regardless to what Versions they are. 我想要做的是对于具有相同Publisher的行，我会在Product_Code中使用grep（）来查找具有相同Product Code的行，而不管它们是什么版本。 And set them to have the oldest Published_Date. 并将它们设置为具有最早的Published_Date。

So the result will look like this: 所以结果看起来像这样：

Product_Code      Publisher    Published_Date
AB1F                  A            1999
AB1F (A Version)      A            1999
TG1F (B Version)      B            2001
AB1Z (A Version)      A            2003
TG1F                  B            2001
GX1T                  C            2011

I tried 我试过了

for (n in 1:nrow(df)) {
   A=which(grepl(df[n,1],df[,1])==TRUE & df[n,2]==df[,2])
   min.date=min(df[A,3])
   df[A,3]=min.date
}

I am not sure if this for loop code even works because my computer will never finish running the code. 我不确定这个for循环代码是否有效，因为我的计算机永远不会完成运行代码。

Any help will be appreciated! 任何帮助将不胜感激！

Answer 1

We can use data.table . 我们可以使用data.table 。 Convert the 'data.frame' to 'data.table' ( setDT(df1) ). 将'data.frame'转换为'data.table'（ setDT(df1) ）。 We remove the substring that matches the space followed by ( followed by one of more characters using sub , use that as a grouping variable, if there are any ( character in the 'Product_Code', then we match 'A', 'B' with the substring from 'Product_Code', remove the NAs, use that to subset the 'Published_Date', get the min of that or else return the 'Published_Date' and assign ( := ) it to 'Published_Date'. 我们删除的空间，然后匹配的字符串(其次是使用更多的字符中的一个sub ，用它作为分组变量， if有任何(在“PRODUCT_CODE”字符，那么我们match “A”，“B”与来自'Product_Code'的子字符串，删除NA，使用它来对'Published_Date'进行子集化，得到min ， else返回'Published_Date'并将（ := ）赋值给'Published_Date'。

library(data.table)
setDT(df1)[, Published_Date := if(any(grep("\\(", Product_Code))) 
  min(Published_Date[na.omit(match(c("A", "B"), sub(".*\\((.).*", "\\1", Product_Code)))])
   else Published_Date , by = .(grp=sub("\\s+.*", "", Product_Code))]
     Product_Code Publisher Published_Date
#1:             AB1F         A           1999
#2: AB1F (A Version)         A           1999
#3: TG1F (B Version)         B           2001
#4: AB1Z (A Version)         A           2003
#5:             TG1F         B           2001
#6:             GX1T         C           2011

Or with dplyr , we separate the 'Product_Code' into two columns ("Product", "Version"), grouped by "Product", we mutate the 'Published_Date' based on an if/else condition. 或者使用dplyr ，我们separate 'Product_Code'分成两列（“Product”，“Version”），按“Product”分组，我们根据if/else条件mutate 'Published_Date'。

library(dplyr)
library(tidyr)
df1 %>% 
    separate(Product_Code, into = c("Product", "Version"), remove=FALSE) %>%
    group_by(Product) %>% 
    mutate(Published_Date = if(all(is.na(Version))) Published_Date
          else min(Published_Date[Version == Publisher & !is.na(Version)])) %>%
    ungroup() %>%   
    select(-Product, - Version)
#      Product_Code Publisher Published_Date
#             <chr>     <chr>          <int>
#1             AB1F         A           1999
#2 AB1F (A Version)         A           1999
#3 TG1F (B Version)         B           2001
#4 AB1Z (A Version)         A           2003
#5             TG1F         B           2001
#6             GX1T         C           2011

Instead of separate , we can also use extract to avoid the warning message 我们也可以使用extract来避免警告消息，而不是separate

df1 %>% 
   extract(Product_Code, into = c("Product", "Version"), 
                     "(\\S+)\\s*\\(*(\\S*).*", remove = FALSE)%>%
   group_by(Product) %>%
   mutate(Published_Date = if(all(!nzchar(Version))) Published_Date
      else min(Published_Date[Version == Publisher])) %>%
   ungroup() %>%
   select(-Product, -Version)
#     Product_Code Publisher Published_Date
#             <chr>     <chr>          <int>
#1             AB1F         A           1999
#2 AB1F (A Version)         A           1999
#3 TG1F (B Version)         B           2001
#4 AB1Z (A Version)         A           2003
#5             TG1F         B           2001
#6             GX1T         C           2011

Update 更新

If there are no specific patterns, we can create a ( for elements that don't have ( and have more than 1 word 如果没有特定的模式，我们可以创建一个(对于没有的元素(并且有超过1个单词）

df1$Product_Code <- sub("\\s+\\(*", " (", df1$Product_Code)

and use the above codes. 并使用上述代码。

为此在R中运行for循环的更快方法？

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-06-25 08:43:08

Update 更新

为此在R中运行for循环的更快方法？

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-06-25 08:43:08

Update 更新

解决方案1
3 已采纳 2016-06-25 08:43:08