数据帧中字符串的两列的部分匹配

Question

I have a dataframe with Name1 (10 observations), and Name2 , with 3 observations. 我有一个带有Name1 （10个观察值）的数据帧，以及有2个观察值的Name2 。 I have the following toy example: 我有以下玩具示例：

   Name1                            Name2         
Acadian Hospitals                 Wellington      
Bridgewater Trust Associates      Zeus        
Concordia Consulting              Acadian
Wellington Corporation LLC          .
Wellington Wealth Management        .
Prime Acadian Charity

If Name1 is able to match a part of its string in Name2 , I want the output in column3 to be TRUE . 如果Name1是能够匹配其字符串的一部分Name2 ，我想在栏3输出为TRUE 。 Currently, my code only works the other way around, using pmatch 目前，我的代码只能使用pmatch

My final output should look like this: 我的最终输出应如下所示：

   Name1                            Name2           Is_Matched
Acadian Hospitals                 Wellington           TRUE
Bridgewater Trust Associates      Zeus                 FALSE
Concordia Consulting              Acadian              FALSE
Wellington Corporation LLC          .                  TRUE
Wellington Wealth Management        .                  TRUE
Prime Acadian Charity               .                  TRUE

Answer 1

It sounds like Name2 is really just a set of lookup values. 听起来Name2实际上只是一组查找值。 In that case you could build a lookup by pasting all the values together and then do one simple grepl search on all of df$Name2 : 在这种情况下，您可以通过粘贴所有值来构建查找，然后在所有df$Name2上执行一个简单的grepl搜索：

df$Is_Matched <- grepl(paste(df$Name2[df$Name2 == "."], collapse = "|"), df$Name1)
#                         Name1      Name2 Is_Matched
#1            Acadian Hospitals Wellington       TRUE
#2 Bridgewater Trust Associates       Zeus      FALSE
#3         Concordia Consulting    Acadian      FALSE
#4   Wellington Corporation LLC          .       TRUE
#5 Wellington Wealth Management          .       TRUE
#6        Prime Acadian Charity          .       TRUE

Note this assumes that missing values in Name2 are coded as "." 请注意，这假设Name2中的缺失值被编码为"." rather than NA . 而不是NA 。 It would be easy enough to change to any other coding of missing values. 将其更改为缺失值的任何其他编码将非常容易。

Answer 2

You could use sapply . 你可以用sapply 。 Without an example I think something like this should work. 没有一个例子，我认为这样的事情应该有效。 I'll check on an example in a sec. 我会在一秒钟内检查一个例子。

df$Is_Matched <- sapply(df$Name2, function(x) any(grepl(x, df$Name1))

EDIT: 编辑：

Creating an example dataframe helped. 创建示例数据框有所帮助。 sapply was exporting a matrix with each word in Name2 having its own column. sapply正在导出一个矩阵， Name2中的每个单词都有自己的列。 So, you can test to see if any row contains a true using rowSums (true = 1, false = 0). 因此，您可以使用rowSums测试是否有任何行包含true（true = 1，false = 0）。 Let me know if you have any issues with it. 如果您有任何问题，请告诉我。

> df <- data.frame(
+   Name1 = c("Acadian Hospitals", "Bridgewater Trust Associates",
+             "Concordia Consulting", "Wellington Corporation LLC",
+             "Wellington Wealth Management", "Prime Acadian Charity"),
+   Name2 = c("Wellington", "Zeus", "Acadian", NA, NA, NA),
+   stringsAsFactors = FALSE
+ )
> 
> match_me <- na.omit(df$Name2)
> df$Is_Matched <- rowSums(sapply(match_me, function(x) grepl(x, df$Name1))) > 0
> df
                         Name1      Name2 Is_Matched
1            Acadian Hospitals Wellington       TRUE
2 Bridgewater Trust Associates       Zeus      FALSE
3         Concordia Consulting    Acadian      FALSE
4   Wellington Corporation LLC       <NA>       TRUE
5 Wellington Wealth Management       <NA>       TRUE
6        Prime Acadian Charity       <NA>       TRUE

Answer 3

With assistance from Mike H. : 在Mike H.的协助下：

Name1 = c("Bridgewater Trust Associates", "Acadian Wealth Management", "Wellington Wealth Trust", "Concordia University", "Southern Zeus College", "Parametric Modeling", "Wellington City Corporation", "Hotel Zanzibar") 
Name2 = c("Acadian", "Wellington", "Zeus")

max.len = max(length(Name1), length(Name2))
Name1 = c(Name1, rep(NA, max.len - length(Name1)))
Name2 = c(Name2, rep(NA, max.len - length(Name2)))
column3 <- grepl(paste(Name2, collapse = "|"), Name1)

df <- data.frame(Name1, Name2, column3, stringsAsFactors = FALSE)

数据帧中字符串的两列的部分匹配

问题描述

3 个解决方案

解决方案1
4 已采纳 2019-03-02 01:18:42

解决方案2
2 2019-03-02 01:10:19

解决方案3
2 2019-03-02 02:46:28

数据帧中字符串的两列的部分匹配

问题描述

3 个解决方案

解决方案1 4 已采纳 2019-03-02 01:18:42

解决方案2 2 2019-03-02 01:10:19

解决方案3 2 2019-03-02 02:46:28

解决方案1
4 已采纳 2019-03-02 01:18:42

解决方案2
2 2019-03-02 01:10:19

解决方案3
2 2019-03-02 02:46:28