[英]Faster way to run a for loop in R for this?
So, this is what my dataframe looks like: 所以,这就是我的数据帧:
Product_Code Publisher Published_Date
AB1F A 2011
AB1F (A Version) A 1999
TG1F (B Version) B 2001
AB1Z (A Version) A 2003
TG1F B 2006
GX1T C 2011
with about 1.3 million rows. 大约有130万行。
What I'm trying to do is for rows with the same Publisher, I would use grep() in Product_Code to find rows with the same Product Code regardless to what Versions they are. 我想要做的是对于具有相同Publisher的行,我会在Product_Code中使用grep()来查找具有相同Product Code的行,而不管它们是什么版本。 And set them to have the oldest Published_Date.
并将它们设置为具有最早的Published_Date。
So the result will look like this: 所以结果看起来像这样:
Product_Code Publisher Published_Date
AB1F A 1999
AB1F (A Version) A 1999
TG1F (B Version) B 2001
AB1Z (A Version) A 2003
TG1F B 2001
GX1T C 2011
I tried 我试过了
for (n in 1:nrow(df)) {
A=which(grepl(df[n,1],df[,1])==TRUE & df[n,2]==df[,2])
min.date=min(df[A,3])
df[A,3]=min.date
}
I am not sure if this for loop code even works because my computer will never finish running the code. 我不确定这个for循环代码是否有效,因为我的计算机永远不会完成运行代码。
Any help will be appreciated! 任何帮助将不胜感激!
We can use data.table
. 我们可以使用
data.table
。 Convert the 'data.frame' to 'data.table' ( setDT(df1)
). 将'data.frame'转换为'data.table'(
setDT(df1)
)。 We remove the substring that matches the space followed by (
followed by one of more characters using sub
, use that as a grouping variable, if
there are any (
character in the 'Product_Code', then we match
'A', 'B' with the substring from 'Product_Code', remove the NAs, use that to subset the 'Published_Date', get the min
of that or else
return the 'Published_Date' and assign ( :=
) it to 'Published_Date'. 我们删除的空间,然后匹配的字符串
(
其次是使用更多的字符中的一个sub
,用它作为分组变量, if
有任何(
在“PRODUCT_CODE”字符,那么我们match
“A”,“B”与来自'Product_Code'的子字符串,删除NA,使用它来对'Published_Date'进行子集化,得到min
, else
返回'Published_Date'并将( :=
)赋值给'Published_Date'。
library(data.table)
setDT(df1)[, Published_Date := if(any(grep("\\(", Product_Code)))
min(Published_Date[na.omit(match(c("A", "B"), sub(".*\\((.).*", "\\1", Product_Code)))])
else Published_Date , by = .(grp=sub("\\s+.*", "", Product_Code))]
Product_Code Publisher Published_Date
#1: AB1F A 1999
#2: AB1F (A Version) A 1999
#3: TG1F (B Version) B 2001
#4: AB1Z (A Version) A 2003
#5: TG1F B 2001
#6: GX1T C 2011
Or with dplyr
, we separate
the 'Product_Code' into two columns ("Product", "Version"), grouped by "Product", we mutate
the 'Published_Date' based on an if/else
condition. 或者使用
dplyr
,我们separate
'Product_Code'分成两列(“Product”,“Version”),按“Product”分组,我们根据if/else
条件mutate
'Published_Date'。
library(dplyr)
library(tidyr)
df1 %>%
separate(Product_Code, into = c("Product", "Version"), remove=FALSE) %>%
group_by(Product) %>%
mutate(Published_Date = if(all(is.na(Version))) Published_Date
else min(Published_Date[Version == Publisher & !is.na(Version)])) %>%
ungroup() %>%
select(-Product, - Version)
# Product_Code Publisher Published_Date
# <chr> <chr> <int>
#1 AB1F A 1999
#2 AB1F (A Version) A 1999
#3 TG1F (B Version) B 2001
#4 AB1Z (A Version) A 2003
#5 TG1F B 2001
#6 GX1T C 2011
Instead of separate
, we can also use extract
to avoid the warning message 我们也可以使用
extract
来避免警告消息,而不是separate
df1 %>%
extract(Product_Code, into = c("Product", "Version"),
"(\\S+)\\s*\\(*(\\S*).*", remove = FALSE)%>%
group_by(Product) %>%
mutate(Published_Date = if(all(!nzchar(Version))) Published_Date
else min(Published_Date[Version == Publisher])) %>%
ungroup() %>%
select(-Product, -Version)
# Product_Code Publisher Published_Date
# <chr> <chr> <int>
#1 AB1F A 1999
#2 AB1F (A Version) A 1999
#3 TG1F (B Version) B 2001
#4 AB1Z (A Version) A 2003
#5 TG1F B 2001
#6 GX1T C 2011
If there are no specific patterns, we can create a (
for elements that don't have (
and have more than 1 word 如果没有特定的模式,我们可以创建一个
(
对于没有的元素(
并且有超过1个单词)
df1$Product_Code <- sub("\\s+\\(*", " (", df1$Product_Code)
and use the above codes. 并使用上述代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.