用Grepl提取R中的匹配项

Question

In RI have: 在RI中有：

library(tidyverse)
full_names <- tibble(FIRM = c("APPLE INC.", "MICROSOFT CORPORATION", "GOOGLE", "TESLA INC.", "ABBOTT LABORATORIES"), 
                 TICKER = c("AAPL", "MSFT", "GOOGL", "TSLA", "ABT"),
                 ID = c(111, 222, 333, 444, 555)) # a dataset with full names of firms, including some IDs
abbr_names <- c("Abbott", "Apple", "Coca-Cola", "Pepsi, "Microsoft", "Tesla") # a vector with abbreviated names of firms

I want to check if the abbreviated names are in the full names dataset, and if true subsequently match the full_names row to the abbr_names vector, like: 我想检查缩写名称是否在全名数据集中，如果为true，则将full_names行匹配到abbr_names向量，例如：

    [1]        [2]                    [3]   [4]
[1] Abbott     ABBOTT LABORATORIES    ABT   555
[2] Apple      APPLE INC.             AAPL  111
[3] Microsoft  MICROSOFT CORPORATION  MSFT  222
[4] Tesla      TESLA INC.             TSLA  444

Tried several str_extract and grepl functions, but could not make it work yet. 尝试了几个str_extract和grepl函数，但无法使其正常工作。

Answer 1

matches <- unlist(sapply(toupper(abbr_names), grep, x = full_names$FIRM, value = TRUE))

That will give you a vector with the names as abbreviations and the firms as values 这将为您提供一个向量，名称为缩写，公司为值

names(matches)
# [1] "ABBOTT"    "APPLE"     "MICROSOFT" "TESLA"  
c(firm_matches, use.names = FALSE)
# [1] "ABBOTT LABORATORIES"   "APPLE INC."            "MICROSOFT CORPORATION" "TESLA INC."

There are a variety of ways to put this together... cobbling... 有多种方法可以将它们组合在一起...

From @Oscar 's comment, we get the desired output with a total of two lines of code: 从@Oscar的注释中，我们获得了所需的输出，总共有两行代码：

matches <- unlist(sapply(toupper(abbr_names), grep, x = full_names$FIRM, value = TRUE))
tibble(ABBR_FIRM = names(matches), FIRM = matches) %>% left_join(., full_names, by = "FIRM")

Answer 2

how about this? 这个怎么样？

full_names$row_num <- 1:nrow(full_names)

do.call(rbind, 
        lapply(abbr_names, 
               function(x){
                 if(sum(grepl(x, full_names$FIRM, ignore.case = TRUE)) > 0){
                   row <- grepl(x, full_names$FIRM, ignore.case = TRUE) %>% 
                     which()} else {row <- 0}
                 data.frame("name" = x,
                            "row_num" = row)})) %>% 
  right_join(full_names, by = "row_num")

Answer 3

Another option might be eg this ... 另一个选择可能是例如...

map_int(abbr_names, ~ {
  idx <- grep(., full_names$FIRM, ignore.case = TRUE)
  if (length(idx) == 0) return(NA) else return(idx)
}) %>% 
  cbind(ABBR = abbr_names, FIRM = full_names$FIRM[.]) %>% 
  as.tibble() %>% 
  left_join(full_names, by = "FIRM") %>%
  complete(FIRM)

# A tibble: 4 x 5
  FIRM                  .     ABBR      TICKER    ID
  <chr>                 <chr> <chr>     <chr>  <dbl>
1 ABBOTT LABORATORIES   5     Abbott    ABT      555
2 APPLE INC.            1     Apple     AAPL     111
3 MICROSOFT CORPORATION 2     Microsoft MSFT     222
4 TESLA INC.            4     Tesla     TSLA     444

Just wanted to still post it :) 只想仍然张贴它:)

Answer 4

My advise, turn on all the word's to upcase or lowercase. 我的建议是，将所有单词都设为大写或小写。 Is more easy to the functions as grepl make comparation. 作为grepl的功能比较容易。

My code: 我的代码：

library(tidyverse)

full_names <- tibble(FIRM = c("APPLE INC.", "MICROSOFT CORPORATION", "GOOGLE", "TESLA INC.", "ABBOTT LABORATORIES"), 
                     TICKER = c("AAPL", "MSFT", "GOOGL", "TSLA", "ABT"),
                     ID = c(111, 222, 333, 444, 555)) # a dataset with full names of firms, including some IDs

abbr_names <- c("Abbott", "Apple", "Coca-Cola", "Microsoft", "Tesla") # a vector with abbreviated names of firms

Here I created a new column, the one we want to index the returns of grepl 在这里，我创建了一个新列，我们要为grepl的收益建立grepl

full_names$new_column <- NA

Then, I did a loop in the name's that we want to index in the dataframe 然后，我在要在数据框中索引的名称中进行了循环

for(i in 1:length(abbr_names)){
  search_test <- grepl(tolower(substr(abbr_names[i], 0,4)), tolower(full_names$FIRM))
  position <- grep("TRUE", search_test)
  full_names$new_column[position] <- abbr_names[i]
}

The result is the follow dataframe: 结果是以下数据框：

   FIRM               TICKER    ID  new_column
1 APPLE INC.            AAPL     111 Apple     
2 MICROSOFT CORPORATION MSFT     222 Microsoft 
3 GOOGLE                GOOGL    333 NA        
4 TESLA INC.            TSLA     444 Tesla     
5 ABBOTT LABORATORIES   ABT      555 Abbott

"GOOG" is not in the abbr_names vector, so the return is NA “ GOOG”不在abbr_names向量中，因此返回值为NA

用Grepl提取R中的匹配项

问题描述

4 个解决方案

解决方案1
3 已采纳 2018-03-05 12:28:28

解决方案2
1 2018-03-05 12:39:27

解决方案3
0 2018-03-05 12:47:35

解决方案4
0 2018-03-05 12:54:40

用Grepl提取R中的匹配项

问题描述

4 个解决方案

解决方案1 3 已采纳 2018-03-05 12:28:28

解决方案2 1 2018-03-05 12:39:27

解决方案3 0 2018-03-05 12:47:35

解决方案4 0 2018-03-05 12:54:40

解决方案1
3 已采纳 2018-03-05 12:28:28

解决方案2
1 2018-03-05 12:39:27

解决方案3
0 2018-03-05 12:47:35

解决方案4
0 2018-03-05 12:54:40