从 R 中的另一个数据帧中查找所有字符串匹配项

Question

I am relatively new in R.我在 R 中相对较新。

I have a dataframe locs that has 1 variable V1 and looks like:我有一个数据框locs ，它有 1 个变量V1 ，看起来像：

V1
edmonton general hospital
cardiovascular institute, hospital san carlos, madrid spain
hospital of santa maria, lisbon, portugal

and another dataframe cities that has two variables that look like this:另一个数据框cities有两个如下所示的变量：

city              country
edmonton          canada
san carlos        spain
los angeles       united states
santa maria       united states
tokyo             japan
madrid            spain
santa maria       portugal
lisbon            portugal

I want to create two new variables in locs that relates any string match of V1 within city so that locs looks like this:我想在locs中创建两个新变量，这些变量将city内V1任何字符串匹配相关联，以便locs如下所示：

V1                                            city                  country                      
edmonton general hospital                     edmonton              canada
hospital san carlos, madrid spain             san carlos, madrid    spain
hospital of santa maria, lisbon, portugal     santa maria, lisbon   portugal, united states

A few things to note: V1 may have multiple country names.需要注意的几点： V1可能有多个国家/地区名称。 Also, if there is a repeat country (for instance, both san carlos and madrid are in spain), then I only want one instance of the country.另外，如果有一个重复的国家（例如，圣卡洛斯和马德里都在西班牙），那么我只想要该国家的一个实例。

Please advise.请指教。

Thanks.谢谢。

Answer 1

A solution using tidyverse and stringr .使用tidyverse和stringr解决方案。 locs2 is the final output. locs2是最终输出。

library(tidyverse)
library(stringr)

locs2 <- locs %>%
  rowwise() %>%
  mutate(city = list(str_match(V1, cities$city))) %>%
  unnest() %>%
  drop_na(city) %>%
  left_join(cities, by = "city") %>%
  group_by(V1) %>%
  summarise_all(funs(toString(sort(unique(.)))))

Result结果

locs2 %>% as.data.frame()
                                                           V1                city                 country
1 cardiovascular institute, hospital san carlos, madrid spain  madrid, san carlos                   spain
2                                   edmonton general hospital            edmonton                  canada
3                   hospital of santa maria, lisbon, portugal lisbon, santa maria portugal, united states

DATA数据

library(tidyverse)

locs <- data_frame(V1 = c("edmonton general hospital",
                   "cardiovascular institute, hospital san carlos, madrid spain",
                   "hospital of santa maria, lisbon, portugal"))

cities <- read.table(text = "city              country
edmonton          canada
'san carlos'        spain
'los angeles'       'united states'
'santa maria'       'united states'
tokyo             japan
madrid            spain
'santa maria'       portugal
lisbon            portugal",
                     header = TRUE, stringsAsFactors = FALSE)

从 R 中的另一个数据帧中查找所有字符串匹配项

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-10-06 21:23:58

从 R 中的另一个数据帧中查找所有字符串匹配项

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-10-06 21:23:58

解决方案1
1 已采纳 2017-10-06 21:23:58