[英]Partial match of two columns by string in dataframe
I have a dataframe with Name1
(10 observations), and Name2
, with 3 observations. 我有一个带有Name1
(10个观察值)的数据帧,以及有2个观察值的Name2
。 I have the following toy example: 我有以下玩具示例:
Name1 Name2
Acadian Hospitals Wellington
Bridgewater Trust Associates Zeus
Concordia Consulting Acadian
Wellington Corporation LLC .
Wellington Wealth Management .
Prime Acadian Charity
If Name1
is able to match a part of its string in Name2
, I want the output in column3 to be TRUE
. 如果Name1
是能够匹配其字符串的一部分Name2
,我想在栏3输出为TRUE
。 Currently, my code only works the other way around, using pmatch
目前,我的代码只能使用pmatch
My final output should look like this: 我的最终输出应如下所示:
Name1 Name2 Is_Matched
Acadian Hospitals Wellington TRUE
Bridgewater Trust Associates Zeus FALSE
Concordia Consulting Acadian FALSE
Wellington Corporation LLC . TRUE
Wellington Wealth Management . TRUE
Prime Acadian Charity . TRUE
It sounds like Name2
is really just a set of lookup values. 听起来Name2
实际上只是一组查找值。 In that case you could build a lookup by pasting all the values together and then do one simple grepl
search on all of df$Name2
: 在这种情况下,您可以通过粘贴所有值来构建查找,然后在所有df$Name2
上执行一个简单的grepl
搜索:
df$Is_Matched <- grepl(paste(df$Name2[df$Name2 == "."], collapse = "|"), df$Name1)
# Name1 Name2 Is_Matched
#1 Acadian Hospitals Wellington TRUE
#2 Bridgewater Trust Associates Zeus FALSE
#3 Concordia Consulting Acadian FALSE
#4 Wellington Corporation LLC . TRUE
#5 Wellington Wealth Management . TRUE
#6 Prime Acadian Charity . TRUE
Note this assumes that missing values in Name2
are coded as "."
请注意,这假设Name2
中的缺失值被编码为"."
rather than NA
. 而不是NA
。 It would be easy enough to change to any other coding of missing values. 将其更改为缺失值的任何其他编码将非常容易。
You could use sapply
. 你可以用sapply
。 Without an example I think something like this should work. 没有一个例子,我认为这样的事情应该有效。 I'll check on an example in a sec. 我会在一秒钟内检查一个例子。
df$Is_Matched <- sapply(df$Name2, function(x) any(grepl(x, df$Name1))
EDIT: 编辑:
Creating an example dataframe helped. 创建示例数据框有所帮助。 sapply
was exporting a matrix with each word in Name2
having its own column. sapply
正在导出一个矩阵, Name2
中的每个单词都有自己的列。 So, you can test to see if any row contains a true using rowSums (true = 1, false = 0). 因此,您可以使用rowSums测试是否有任何行包含true(true = 1,false = 0)。 Let me know if you have any issues with it. 如果您有任何问题,请告诉我。
> df <- data.frame(
+ Name1 = c("Acadian Hospitals", "Bridgewater Trust Associates",
+ "Concordia Consulting", "Wellington Corporation LLC",
+ "Wellington Wealth Management", "Prime Acadian Charity"),
+ Name2 = c("Wellington", "Zeus", "Acadian", NA, NA, NA),
+ stringsAsFactors = FALSE
+ )
>
> match_me <- na.omit(df$Name2)
> df$Is_Matched <- rowSums(sapply(match_me, function(x) grepl(x, df$Name1))) > 0
> df
Name1 Name2 Is_Matched
1 Acadian Hospitals Wellington TRUE
2 Bridgewater Trust Associates Zeus FALSE
3 Concordia Consulting Acadian FALSE
4 Wellington Corporation LLC <NA> TRUE
5 Wellington Wealth Management <NA> TRUE
6 Prime Acadian Charity <NA> TRUE
With assistance from Mike H. : 在Mike H.的协助下:
Name1 = c("Bridgewater Trust Associates", "Acadian Wealth Management", "Wellington Wealth Trust", "Concordia University", "Southern Zeus College", "Parametric Modeling", "Wellington City Corporation", "Hotel Zanzibar")
Name2 = c("Acadian", "Wellington", "Zeus")
max.len = max(length(Name1), length(Name2))
Name1 = c(Name1, rep(NA, max.len - length(Name1)))
Name2 = c(Name2, rep(NA, max.len - length(Name2)))
column3 <- grepl(paste(Name2, collapse = "|"), Name1)
df <- data.frame(Name1, Name2, column3, stringsAsFactors = FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.