简体   繁体   中英

Partial string match between two columns for large dataset R

I have two columns and I want to create a binary column for if there is a partial match between the two columns.
For example:

X             Y        Match
hello         hello     1
hi hello      hi        1
NA            bye       NA
bye           hi bye    1
good          bad       0

I used following code,

df['Match'] <- ifelse(with(df, str_detect(x, y)|str_detect(y, x)), 1, 0)

which worked for the first few rows but when I used it on the whole dataset (n=14000), I keep getting this error:

Error in stri_detect_regex(string, pattern, opts_regex = opts(pattern)) :
Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

How should I go about solving this problem?

You probably have parentheses in your data or special characters that cause this error.

Try a loop like so:

for(i in 1:nrow(df)) {
  print(i)
  str_detect(df$x[i], df$y[i])
}

the last i printed will tell you which row the problem is in.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM