简体   繁体   English

R 函数,用于检测数据框列是否包含来自另一个数据框列的字符串值并添加包含检测到的 str 的列

[英]R function that detects if a dataframe column contains string values from another dataframe column and adds a column that contains the detected str

I have two dataframes:我有两个数据框:

df1: df1:

name姓名
Apple page苹果页面
Mango page芒果页面
Lychee juice荔枝汁
Cranberry club蔓越莓俱乐部

df2: df2:

fruit水果
Apple苹果
Grapes葡萄
Strawberry草莓
Mango芒果
lychee荔枝
cranberry蔓越莓

If df1$name contains a value in df2$fruit (non case-sensitive), I want to add a column to df1 that has the value from df2$fruit that df1$name contains.如果 df1$name 包含 df2$fruit 中的值(不区分大小写),我想向 df1 添加一个列,该列具有 df1$name 包含的 df2$fruit 值。 df1 would then look like this: df1 然后看起来像这样:

name姓名 category类别
Apple page苹果页面 Apple苹果
Mango page芒果页面 Mango芒果
Lychee juice荔枝汁 lychee荔枝
Cranberry club蔓越莓俱乐部 cranberry蔓越莓

This should work:这应该有效:

library(stringr)
df1$category = str_extract(
  df1$name, 
  pattern = regex(paste(df2$fruit, collapse = "|"), ignore_case = TRUE)
)

df1
#             name  category
# 1     Apple page     Apple
# 2     Mango page     Mango
# 3   Lychee juice    Lychee
# 4 Cranberry club Cranberry

Using this data:使用这些数据:

df1 = read.table(text = 'name
Apple page
Mango page
Lychee juice
Cranberry club', header = T, sep = ";")

df2 = read.table(text = 'fruit
Apple
Grapes
Strawberry
Mango
lychee
cranberry', header = T, sep = ";")

First you could a column for each of the possible categories to the dataframe with the names, as placeholders (just filled with NA).首先,您可以使用名称作为占位符(仅填充 NA)为数据框的每个可能类别创建一列。 Then for each of those columns, check whether the column name (so the category) appears in the name.然后对于这些列中的每一个,检查列名(即类别)是否出现在名称中。 Turn it into a long dataframe, and then remove the FALSE rows -- those that didn't detect the category in the name.把它变成一个长数据框,然后删除FALSE行——那些没有检测到名称中的类别的行。

library(tidyverse)

df1 <- tribble(
  ~name,
  "Apple page",
  "Mango page",
  "Lychee juice",
  "Cranberry club"
)
df2 <- tribble(
  ~fruit,
  "Apple",
  "Grapes",
  "Strawberry",
  "Mango",
  "lychee",
  "cranberry"
)

fruits <- df2$fruit %>%
  str_to_lower() %>% 
  set_names(rep(NA_character_, length(.)), .)

df1 %>% 
  add_column(!!!fruits) %>% 
  mutate(across(-name, ~str_detect(str_to_lower(name), cur_column()))) %>% 
  pivot_longer(-name, names_to = "category") %>% 
  filter(value) %>% 
  select(-value)

#> # A tibble: 4 × 2
#>   name           category 
#>   <chr>          <chr>    
#> 1 Apple page     apple    
#> 2 Mango page     mango    
#> 3 Lychee juice   lychee   
#> 4 Cranberry club cranberry

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM