简体   繁体   English

使用gsub和regex提取特定单词

[英]Extracting a specific word using gsub and regex

Leapfrogging from a previous question, I'm having problem with the proper reg expression syntax to isolate a specific word. 从前一个问题跳过,我遇到了使用正确的reg表达式语法来隔离特定单词的问题。

Given a data frame: 给定一个数据框:

DL<-c("Dark_ark","Light-Lis","dark7","DK_dark","The_light","Lights","Lig_dark","D_Light")
Col1<-c(1,12,3,6,4,8,2,8)
DF<-data.frame(Col1)
row.names(DF)<-DL

I'm looking extract all of the "Dark" and "Light" (ignoring upper vs lower case) from the row names and make a second column containing only the string "Dark" or "Light" 我正在从行名称中提取所有“黑暗”和“光”(忽略大写与小写)并创建仅包含字符串“Dark”或“Light”的第二列

Col2<-c("Dark","Light","dark","dark","light","Light","dark","Light")
DF$Col2<-Col2

          Col1  Col2
Dark_ark     1  Dark
Light-Lis   12 Light
dark7        3  dark
DK_dark      6  dark
The_light    4 light
Lights       8 Light
Lig_dark     2  dark
D_Light      8 Light

Ive changed the original data a bit to detail my current issue, but working of an excellent answer from Tyler Rinker, I used this: 我已经改变了原始数据以详细说明我当前的问题,但是Tyler Rinker的一个很好的答案,我使用了这个:

DF$Col2<-gsub("[^dark|light]", "", row.names(DF), ignore.case = TRUE)

But the gsub gets tripped up on some of the letters in common. 但gsub在一些共同的字母上被绊倒了。 Searching the message boards for isolating an exact word with regex, it looks like the answer should be to use double slash with either 搜索留言板以使用正则表达式隔离一个确切的单词,看起来答案应该是使用双斜杠

\\<light\\>

or 要么

\\blight\\b

So why does the line 那么为什么这条线

DF$Col2<-gsub("[^\\<dark\\>|\\<light\\>]", "", row.names(DF), ignore.case = TRUE)

Not pull the desired column above? 不拉上面的所需栏? Instead I get 相反,我得到了

          Col1    Col2
Dark_ark     1 Darkark
Light-Lis   12 LightLi
dark7        3    dark
DK_dark      6  DKdark
The_light    4 Thlight
Lights       8   Light
Lig_dark     2 Ligdark
D_Light      8  DLight

How about this? 这个怎么样?

unlist(regmatches(rownames(DF), gregexpr("dark|light", rownames(DF), ignore.case=TRUE)))
# [1] "Dark"  "Light" "dark"  "dark"  "light" "Light" "dark"  "Light"

or 要么

gsub(".*(dark|light).*$", "\\1", row.names(DF), ignore.case = TRUE)
# [1] "Dark"  "Light" "dark"  "dark"  "light" "Light" "dark"  "Light"

One option is to use stringr package: 一种选择是使用stringr包:

library(stringr) 
str_extract(tolower(rownames(DF)),'dark|light')
[1] "dark"  "light" "dark"  "dark"  "light" "light" "dark"  "light"

Or better using @Arun suggestion: 或者更好地使用@Arun建议:

str_extract(rownames(DF), ignore.case('dark|light'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM