简体   繁体   English

R:从字符串中提取带有大写字母的 substring

[英]R: extract substring with capital letters from string

I have a dataframe with strings in a column.我有一个 dataframe 列中有字符串。 How could I extract only the substrings that are in capital letters and add them to another column?我怎样才能只提取大写字母的子字符串并将它们添加到另一列?

This is an example:这是一个例子:

    fecha          incident
1   2020-12-01     Check GENERATOR
2   2020-12-01     Check BLADE
3   2020-12-02     Problem in GENERATOR
4   2020-12-01     Check YAW
5   2020-12-02     Alarm in SAFETY SYSTEM

And I would like to create another column as follows:我想创建另一个列,如下所示:

    fecha          incident                  system
1   2020-12-01     Check GENERATOR           GENERATOR
2   2020-12-01     Check BLADE               BLADE
3   2020-12-02     Problem in GENERATOR      GENERATOR
4   2020-12-01     Check YAW                 YAW
5   2020-12-02     Alarm in SAFETY SYSTEM    SAFETY SYSTEM

I have tried with str_sub or str_extract_all using a regex but I believe I'm doing thigs wrong.我曾尝试使用正则表达式使用str_substr_extract_all ,但我相信我做错了。

You can use str_extract if you want to work in a dataframe and tie it into a tidyverse workflow.如果您想在 dataframe 中工作并将其绑定到 tidyverse 工作流程中,您可以使用str_extract

The regex asks either for capital letters or space and there need to be two or more consecutive ones (so it does not find capitalized words).正则表达式要求输入大写字母或空格,并且需要有两个或多个连续的(因此它找不到大写的单词)。 str_trim removes the white-space that can get picked up if the capitalized word is not at the end of the string.如果大写单词不在字符串的末尾, str_trim会删除可以拾取的空格。

Note that this code snipped will only extract the first capitalized words connected via a space.请注意,此代码剪断只会提取通过空格连接的第一个大写单词。 If there are capitalized words in different parts of the string, only the first one will be returned.如果字符串的不同部分有大写单词,则只返回第一个单词。

library(tidyverse)
x <- c("CAPITAL and not Capital", "one more CAP word", "MULTIPLE CAPITAL words", "CAP words NOT connected")
cap <- str_trim(str_extract(x, "([:upper:]|[:space:]){2,}"))
cap
#> [1] "CAPITAL"          "CAP"              "MULTIPLE CAPITAL" "CAP"

Created on 2021-01-08 by the reprex package (v0.3.0)代表 package (v0.3.0) 于 2021 年 1 月 8 日创建

 library(tidyverse)

 string <- data.frame(test="does this WORK")

 string$new <-str_extract_all(string$test, "[A-Z]+")

 string

           test  new
1 does this WORK WORK

If there are cases when the upper-case letters are not next to each other you can use str_extract_all to extract all the capital letters in a sentence and then paste them together.如果存在大写字母不相邻的情况,您可以使用str_extract_all提取句子中的所有大写字母,然后将它们粘贴在一起。

sapply(stringr::str_extract_all(df$incident, '[A-Z]{2,}'),paste0, collapse = ' ')
#[1] "GENERATOR"  "BLADE"    "GENERATOR"     "YAW"     "SAFETY SYSTEM"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM