R：从字符串中提取带有大写字母的 substring

Question

I have a dataframe with strings in a column.我有一个 dataframe 列中有字符串。 How could I extract only the substrings that are in capital letters and add them to another column?我怎样才能只提取大写字母的子字符串并将它们添加到另一列？

This is an example:这是一个例子：

    fecha          incident
1   2020-12-01     Check GENERATOR
2   2020-12-01     Check BLADE
3   2020-12-02     Problem in GENERATOR
4   2020-12-01     Check YAW
5   2020-12-02     Alarm in SAFETY SYSTEM

And I would like to create another column as follows:我想创建另一个列，如下所示：

    fecha          incident                  system
1   2020-12-01     Check GENERATOR           GENERATOR
2   2020-12-01     Check BLADE               BLADE
3   2020-12-02     Problem in GENERATOR      GENERATOR
4   2020-12-01     Check YAW                 YAW
5   2020-12-02     Alarm in SAFETY SYSTEM    SAFETY SYSTEM

I have tried with str_sub or str_extract_all using a regex but I believe I'm doing thigs wrong.我曾尝试使用正则表达式使用str_sub或str_extract_all ，但我相信我做错了。

Answer 1

You can use str_extract if you want to work in a dataframe and tie it into a tidyverse workflow.如果您想在 dataframe 中工作并将其绑定到 tidyverse 工作流程中，您可以使用str_extract 。

The regex asks either for capital letters or space and there need to be two or more consecutive ones (so it does not find capitalized words).正则表达式要求输入大写字母或空格，并且需要有两个或多个连续的（因此它找不到大写的单词）。 str_trim removes the white-space that can get picked up if the capitalized word is not at the end of the string.如果大写单词不在字符串的末尾， str_trim会删除可以拾取的空格。

Note that this code snipped will only extract the first capitalized words connected via a space.请注意，此代码剪断只会提取通过空格连接的第一个大写单词。 If there are capitalized words in different parts of the string, only the first one will be returned.如果字符串的不同部分有大写单词，则只返回第一个单词。

library(tidyverse)
x <- c("CAPITAL and not Capital", "one more CAP word", "MULTIPLE CAPITAL words", "CAP words NOT connected")
cap <- str_trim(str_extract(x, "([:upper:]|[:space:]){2,}"))
cap
#> [1] "CAPITAL"          "CAP"              "MULTIPLE CAPITAL" "CAP"

^{Created on 2021-01-08 by the reprex package (v0.3.0)}^{由代表 package (v0.3.0) 于 2021 年 1 月 8 日创建}

Answer 2

 library(tidyverse)

 string <- data.frame(test="does this WORK")

 string$new <-str_extract_all(string$test, "[A-Z]+")

 string

           test  new
1 does this WORK WORK

Answer 3

If there are cases when the upper-case letters are not next to each other you can use str_extract_all to extract all the capital letters in a sentence and then paste them together.如果存在大写字母不相邻的情况，您可以使用str_extract_all提取句子中的所有大写字母，然后将它们粘贴在一起。

sapply(stringr::str_extract_all(df$incident, '[A-Z]{2,}'),paste0, collapse = ' ')
#[1] "GENERATOR"  "BLADE"    "GENERATOR"     "YAW"     "SAFETY SYSTEM"

R：从字符串中提取带有大写字母的 substring

问题描述

3 个解决方案

解决方案1
3 已采纳 2021-01-08 12:30:42

解决方案2
0 2021-01-08 12:29:47

解决方案3
0 2021-01-08 13:28:13

R：从字符串中提取带有大写字母的 substring

问题描述

3 个解决方案

解决方案1 3 已采纳 2021-01-08 12:30:42

解决方案2 0 2021-01-08 12:29:47

解决方案3 0 2021-01-08 13:28:13

解决方案1
3 已采纳 2021-01-08 12:30:42

解决方案2
0 2021-01-08 12:29:47

解决方案3
0 2021-01-08 13:28:13