So, this is what my dataframe looks like:
Product_Code Publisher Published_Date
AB1F A 2011
AB1F (A Version) A 1999
TG1F (B Version) B 2001
AB1Z (A Version) A 2003
TG1F B 2006
GX1T C 2011
with about 1.3 million rows.
What I'm trying to do is for rows with the same Publisher, I would use grep() in Product_Code to find rows with the same Product Code regardless to what Versions they are. And set them to have the oldest Published_Date.
So the result will look like this:
Product_Code Publisher Published_Date
AB1F A 1999
AB1F (A Version) A 1999
TG1F (B Version) B 2001
AB1Z (A Version) A 2003
TG1F B 2001
GX1T C 2011
I tried
for (n in 1:nrow(df)) {
A=which(grepl(df[n,1],df[,1])==TRUE & df[n,2]==df[,2])
min.date=min(df[A,3])
df[A,3]=min.date
}
I am not sure if this for loop code even works because my computer will never finish running the code.
Any help will be appreciated!
We can use data.table
. Convert the 'data.frame' to 'data.table' ( setDT(df1)
). We remove the substring that matches the space followed by (
followed by one of more characters using sub
, use that as a grouping variable, if
there are any (
character in the 'Product_Code', then we match
'A', 'B' with the substring from 'Product_Code', remove the NAs, use that to subset the 'Published_Date', get the min
of that or else
return the 'Published_Date' and assign ( :=
) it to 'Published_Date'.
library(data.table)
setDT(df1)[, Published_Date := if(any(grep("\\(", Product_Code)))
min(Published_Date[na.omit(match(c("A", "B"), sub(".*\\((.).*", "\\1", Product_Code)))])
else Published_Date , by = .(grp=sub("\\s+.*", "", Product_Code))]
Product_Code Publisher Published_Date
#1: AB1F A 1999
#2: AB1F (A Version) A 1999
#3: TG1F (B Version) B 2001
#4: AB1Z (A Version) A 2003
#5: TG1F B 2001
#6: GX1T C 2011
Or with dplyr
, we separate
the 'Product_Code' into two columns ("Product", "Version"), grouped by "Product", we mutate
the 'Published_Date' based on an if/else
condition.
library(dplyr)
library(tidyr)
df1 %>%
separate(Product_Code, into = c("Product", "Version"), remove=FALSE) %>%
group_by(Product) %>%
mutate(Published_Date = if(all(is.na(Version))) Published_Date
else min(Published_Date[Version == Publisher & !is.na(Version)])) %>%
ungroup() %>%
select(-Product, - Version)
# Product_Code Publisher Published_Date
# <chr> <chr> <int>
#1 AB1F A 1999
#2 AB1F (A Version) A 1999
#3 TG1F (B Version) B 2001
#4 AB1Z (A Version) A 2003
#5 TG1F B 2001
#6 GX1T C 2011
Instead of separate
, we can also use extract
to avoid the warning message
df1 %>%
extract(Product_Code, into = c("Product", "Version"),
"(\\S+)\\s*\\(*(\\S*).*", remove = FALSE)%>%
group_by(Product) %>%
mutate(Published_Date = if(all(!nzchar(Version))) Published_Date
else min(Published_Date[Version == Publisher])) %>%
ungroup() %>%
select(-Product, -Version)
# Product_Code Publisher Published_Date
# <chr> <chr> <int>
#1 AB1F A 1999
#2 AB1F (A Version) A 1999
#3 TG1F (B Version) B 2001
#4 AB1Z (A Version) A 2003
#5 TG1F B 2001
#6 GX1T C 2011
If there are no specific patterns, we can create a (
for elements that don't have (
and have more than 1 word
df1$Product_Code <- sub("\\s+\\(*", " (", df1$Product_Code)
and use the above codes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.