简体   繁体   English

R:确定两个不同数据帧的两个文本字符串之间的第一,第二,第三,第四匹配

[英]R: Identify 1st, 2nd, 3rd, 4th match between two text strings of two different dataframes

Is there any R package in order to identify the position (rowindex) of the 1st, 2nd, 3rd, 4th match between two text string columns of two different dataframes? 是否有任何R包来标识两个不同数据帧的两个文本字符串列之间的第一,第二,第三,第四匹配的位置(行索引)?

For instance: 例如:

I have the following dataframe: 我有以下数据框:

dataframe: simpletext

row text
1   does he go to that bar or for shopping?
2   where was that bar that I wanted?
3   I would like to go to the opera instead for shopping


dataframe: keywords

row  word
1    shopping
2    opera
3    bar

What I want is to find that the first match of simpletext$text[1] is keywords$word[3] 我想要找到的是simpletext $ text [1]的第一个匹配项是关键字$ word [3]

the second match of simpletext$text[1] is keywords$word[1] and so on for every row or simpletext simpletext $ text [1]的第二个匹配项是关键字$ word [1],依此类推,对于每一行或simpletext

You might start with something like this: 您可能会从以下内容开始:

library(tidyverse)
find_locations <- function(word, text) {
  bind_cols(
    data_frame(
      word = word,
      text = text
    ),
    as_data_frame(str_locate(text, word))
  )
}

map_df(keywords$word, find_locations, text = simpletext$text)

You can use regexpr ( grep family) function: 您可以使用regexprgrep系列)功能:

keywords = rbind("shopping","opera","bar")
simpletext = rbind("does he go to that bar or for shopping?",
                   "where was that bar that I wanted?",
                   "I would like to go to the opera instead for shopping")

text_match <- function(text,keywords)
{
  # check all keywords for matching
  matches <- vapply(keywords[1:length(keywords)], function(x) regexpr(x,text)[1], FUN.VALUE=1) 
  # sort matched keywords in order of appearance
  sorted_matches <- names(sort(matches[matches>0])) 
  # return indices of sorted matches
  indices <- vapply(sorted_matches, function(x) which(keywords == x),FUN.VALUE=1) 
  return (indices)
}

where regexpr(x,text)[1] returns the position of the first match of x in text or -1 if there is none. 其中regexpr(x,text)[1]返回xtext中的第一个匹配项的位置,如果没有则返回-1

The result is as follows: 结果如下:

text_match(simpletext[1],keywords)
#bar shopping 
#3        1 
text_match(simpletext[2],keywords)
# bar 
# 3
text_match(simpletext[3],keywords)
# opera shopping 
# 2        1 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 替换第 2 和第 3 个冒号,但在 R 中的字符串中保留第 1 个 - Replacing 2nd and 3rd Colons but Keeping 1st in String in R R:添加一个较短长度的列,减去单个列中的每一行,第1列 - 第2行,第2行 - 第3行 - R: Add a column of shorter length that subtracts each row in a single column, 1st - 2nd, 2nd - 3rd 以已发布的质量在R中将第一列绘制为X轴,在Y轴上绘制第二列和第三列 - Plot 1st column as X-axis and 2nd and 3rd columns on Y-axis in R with published quality R,创建由第一列组成的新列,或者如果满足条件,则创建第二列/第三列的值 - R, create new column that consists of 1st column or if condition is met, a value from the 2nd/3rd column 计算模式或第 2/3/4 个最常用值 - Calculating the mode or 2nd/3rd/4th most common value 将第 1/3 个四分位数和第 90 个百分位数添加到 R 中的折线图 - Adding 1st/3rd quartile and 90th percentile to a line chart in R 合并 2 个 R 数据帧,保持来自第 2 个 dataframe 的匹配行和第 1 个不匹配的行 - Merge 2 R dataframes keeping matched rows from 2nd dataframe and unmatched from 1st 确定一周中的某一天是否是 R 中的第 2/3 日等星期一/星期二/等 - Identify if a day of the week is 2nd/3rd etc Mon/Tues/etc day of the month in R R 匹配两个不同数据帧之间的两个字符串列 - R Match two string columns between two different dataframes 仅 R 中第 1 和第 3 四分位数之间包含的数据的平均值 - Mean value only of the data contained between the 1st and 3rd quartile in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM