R 函数，用于检测数据框列是否包含来自另一个数据框列的字符串值并添加包含检测到的 str 的列

Question

我有两个数据框：

df1：

姓名
苹果页面
芒果页面
荔枝汁
蔓越莓俱乐部

df2:

水果
苹果
葡萄
草莓
芒果
荔枝
蔓越莓

如果 df1$name 包含 df2$fruit 中的值（不区分大小写），我想向 df1 添加一个列，该列具有 df1$name 包含的 df2$fruit 值。 df1 然后看起来像这样：

姓名	类别
苹果页面	苹果
芒果页面	芒果
荔枝汁	荔枝
蔓越莓俱乐部	蔓越莓

Answer 1

这应该有效：

library(stringr)
df1$category = str_extract(
  df1$name, 
  pattern = regex(paste(df2$fruit, collapse = "|"), ignore_case = TRUE)
)

df1
#             name  category
# 1     Apple page     Apple
# 2     Mango page     Mango
# 3   Lychee juice    Lychee
# 4 Cranberry club Cranberry

使用这些数据：

df1 = read.table(text = 'name
Apple page
Mango page
Lychee juice
Cranberry club', header = T, sep = ";")

df2 = read.table(text = 'fruit
Apple
Grapes
Strawberry
Mango
lychee
cranberry', header = T, sep = ";")

Answer 2

首先，您可以使用名称作为占位符（仅填充 NA）为数据框的每个可能类别创建一列。 然后对于这些列中的每一个，检查列名（即类别）是否出现在名称中。 把它变成一个长数据框，然后删除FALSE行——那些没有检测到名称中的类别的行。

library(tidyverse)

df1 <- tribble(
  ~name,
  "Apple page",
  "Mango page",
  "Lychee juice",
  "Cranberry club"
)
df2 <- tribble(
  ~fruit,
  "Apple",
  "Grapes",
  "Strawberry",
  "Mango",
  "lychee",
  "cranberry"
)

fruits <- df2$fruit %>%
  str_to_lower() %>% 
  set_names(rep(NA_character_, length(.)), .)

df1 %>% 
  add_column(!!!fruits) %>% 
  mutate(across(-name, ~str_detect(str_to_lower(name), cur_column()))) %>% 
  pivot_longer(-name, names_to = "category") %>% 
  filter(value) %>% 
  select(-value)

#> # A tibble: 4 × 2
#>   name           category 
#>   <chr>          <chr>    
#> 1 Apple page     apple    
#> 2 Mango page     mango    
#> 3 Lychee juice   lychee   
#> 4 Cranberry club cranberry

R 函数，用于检测数据框列是否包含来自另一个数据框列的字符串值并添加包含检测到的 str 的列

问题描述

2 个解决方案

解决方案1
2 2022-05-20 02:03:24

解决方案2
0 已采纳 2022-05-20 02:21:42

R 函数，用于检测数据框列是否包含来自另一个数据框列的字符串值并添加包含检测到的 str 的列

问题描述

2 个解决方案

解决方案1 2 2022-05-20 02:03:24

解决方案2 0 已采纳 2022-05-20 02:21:42

解决方案1
2 2022-05-20 02:03:24

解决方案2
0 已采纳 2022-05-20 02:21:42