[英]Get row index to update another dataframe in loop
我有以下數據:
編輯:
df<- data.frame(
id = c(432, 324, 322, 331, 242,443,223 ),
desc1= c("metal","steels&iron","irons\\copper", "sports material", "leather material", "durable goods", "electronic store")
,
store_names = c("ik bros","steel idrs", "kb materials", "ca pty (ltd)", "bkk stores", "k/k \\shop", "h/j & jj")
,
class = c("", "unknown","", "sports", "unknown", "unknown", "")
)
我想從 desc1 和 desc2 中搜索關鍵字,並為類列分配一個字符串值。 例如,關鍵字可以是
indus_1 <- c("iron", "steel")
goods_store_1 <- c("goods", "store", "stores")
electr_1 <- c("electronic", "chips", "semiconductor")
unlabelled_1 <- c("leather")
這里變量名 indus_1、sports_1 等將用於為類分配字符串值。 例如,如果找到“metal”關鍵字,我會在將“_1”剝離到 class 后分配 indus。 在我的方法中,我正在查找找到關鍵字的行的索引並將它們復制到相同數據幀的副本,但是對於較大的數據集,這需要很長時間,並且可能會錯過幾個類,因為我使用\\\\b
來查找完全匹配。 這是預期的輸出:
id desc1 store_names class
432 metal ik bros
324 steels&iron steel idrs indus
322 irons\\copper kb materials indus
331 sports material ca pty (ltd) sports
242 leather material bkk stores unlabelled
443 durable goods k/k \\shop goods_store
223 electronic store h/j & jj electr
我正在尋找一種更有效的方法來做同樣的事情,一個完整的 dplyr 版本會更可取。 感謝您的建議。
不確定我是否正確解釋了您的問題; 這是你想做的嗎?
library(tidyverse)
df<- data.frame(
id = c(432, 324, 322, 331, 242 ),
desc1 = c("iron and metal","sports material", "leather material", "durable goods", "electronic goods")
,
desc2 = c("ik bros", "ca pty (ltd)", "bkk stores", "k/k \\shop", "h/j & jj")
,
class = c("", "sports", "unknown", "unknown", "")
)
df2 <- df %>%
mutate(class = case_when(str_detect(desc1, "metal") | str_detect(desc2, "metal") ~ "indus",
str_detect(desc1, "sports") | str_detect(desc2, "sports") ~ "sports",
str_detect(desc1, "electronic") | str_detect(desc2, "electronic") ~ "electr",
str_detect(desc1, "goods") | str_detect(desc2, "goods") ~ "goods_store",
str_detect(desc1, "leather") | str_detect(desc2, "leather") ~ "unlabelled"))
df2
#> id desc1 desc2 class
#> 1 432 iron and metal ik bros indus
#> 2 324 sports material ca pty (ltd) sports
#> 3 322 leather material bkk stores unlabelled
#> 4 331 durable goods k/k \\shop goods_store
#> 5 242 electronic goods h/j & jj electr
由reprex 包(v2.0.1) 於 2021 年 10 月 25 日創建
在這種情況下,你可以這樣做:
vars_1 <- mget(ls(pattern = '_1'))
vars_1 <- vars_1[!grepl('vars', names(vars_1))]
pat <- sub("_1", "", names(vars_1))
names(pat) <- sprintf(".*(%s).*", unlist(vars_1))
df %>%
mutate(class = str_replace_all(invoke(str_c, across(starts_with('desc'))), pat))
id desc1 desc2 class
1 432 iron and metal ik bros indus
2 324 sports material ca pty (ltd) sports
3 322 leather material bkk stores unlabelled
4 331 durable goods k/k \\shop goods_store
5 242 electronic goods h/j & jj electr
從邏輯上講,我的答案類似於 @Onyambu 的答案,但幾乎沒有調整。
library(tidyverse)
mget(ls(pattern = '_1')) %>%
stack() %>%
group_by(ind = sub('_1', '', ind)) %>%
summarise(values = sprintf('.*\\b(%s)\\b.*', paste0(values, collapse = '|'))) %>%
select(2, 1) %>%
deframe() -> pat
pat
#.*\\b(electronic|chips|semiconductor)\\b.* .*\\b(goods|store|stores)\\b.*
# "electr" "goods_store"
# .*\\b(iron|steel)\\b.* .*\\b(leather)\\b.*
# "indus" "unlabelled"
df %>%
mutate(class2 = str_replace_all(desc1, pat),
class2 = ifelse(desc1 == class2, '', class2))
# id desc1 store_names class class2
#1 432 metal ik bros
#2 324 steels&iron steel idrs unknown indus
#3 322 irons\\copper kb materials
#4 331 sports material ca pty (ltd) sports
#5 242 leather material bkk stores unknown unlabelled
#6 443 durable goods k/k \\shop unknown goods_store
#7 223 electronic store h/j & jj electr
對於id = 322
它與indus
不匹配,因為我們正在尋找完全匹配。 indus_1
有iron
而desc1
列有irons
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.