[英]R: extract substring with capital letters from string
I have a dataframe with strings in a column.我有一个 dataframe 列中有字符串。 How could I extract only the substrings that are in capital letters and add them to another column?我怎样才能只提取大写字母的子字符串并将它们添加到另一列?
This is an example:这是一个例子:
fecha incident
1 2020-12-01 Check GENERATOR
2 2020-12-01 Check BLADE
3 2020-12-02 Problem in GENERATOR
4 2020-12-01 Check YAW
5 2020-12-02 Alarm in SAFETY SYSTEM
And I would like to create another column as follows:我想创建另一个列,如下所示:
fecha incident system
1 2020-12-01 Check GENERATOR GENERATOR
2 2020-12-01 Check BLADE BLADE
3 2020-12-02 Problem in GENERATOR GENERATOR
4 2020-12-01 Check YAW YAW
5 2020-12-02 Alarm in SAFETY SYSTEM SAFETY SYSTEM
I have tried with str_sub
or str_extract_all
using a regex but I believe I'm doing thigs wrong.我曾尝试使用正则表达式使用str_sub
或str_extract_all
,但我相信我做错了。
You can use str_extract
if you want to work in a dataframe and tie it into a tidyverse workflow.如果您想在 dataframe 中工作并将其绑定到 tidyverse 工作流程中,您可以使用str_extract
。
The regex asks either for capital letters or space and there need to be two or more consecutive ones (so it does not find capitalized words).正则表达式要求输入大写字母或空格,并且需要有两个或多个连续的(因此它找不到大写的单词)。 str_trim
removes the white-space that can get picked up if the capitalized word is not at the end of the string.如果大写单词不在字符串的末尾, str_trim
会删除可以拾取的空格。
Note that this code snipped will only extract the first capitalized words connected via a space.请注意,此代码剪断只会提取通过空格连接的第一个大写单词。 If there are capitalized words in different parts of the string, only the first one will be returned.如果字符串的不同部分有大写单词,则只返回第一个单词。
library(tidyverse)
x <- c("CAPITAL and not Capital", "one more CAP word", "MULTIPLE CAPITAL words", "CAP words NOT connected")
cap <- str_trim(str_extract(x, "([:upper:]|[:space:]){2,}"))
cap
#> [1] "CAPITAL" "CAP" "MULTIPLE CAPITAL" "CAP"
Created on 2021-01-08 by the reprex package (v0.3.0)由代表 package (v0.3.0) 于 2021 年 1 月 8 日创建
library(tidyverse)
string <- data.frame(test="does this WORK")
string$new <-str_extract_all(string$test, "[A-Z]+")
string
test new
1 does this WORK WORK
If there are cases when the upper-case letters are not next to each other you can use str_extract_all
to extract all the capital letters in a sentence and then paste them together.如果存在大写字母不相邻的情况,您可以使用str_extract_all
提取句子中的所有大写字母,然后将它们粘贴在一起。
sapply(stringr::str_extract_all(df$incident, '[A-Z]{2,}'),paste0, collapse = ' ')
#[1] "GENERATOR" "BLADE" "GENERATOR" "YAW" "SAFETY SYSTEM"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.