[英]How to clean up dataframe column with regular expression?
Consider this dataframe:考虑这个 dataframe:
df <- data.frame(Index=c(1:4),
Perc1=c("SC(23.43%","12.21%","","(18.44%"))
Index Perc1
1 1 SC(23.43%
2 2 12.21%
3 3
4 4 (18.44%
The goal is to clean up its column Perc1
with regex.目标是用正则表达式清理它的
Perc1
列。
Desired result:期望的结果:
Index Perc1
1 1 0.2343
2 2 0.1221
3 3
4 4 0.1844
I tried the following code, but I get an error and a wrong result.我尝试了以下代码,但出现错误和错误结果。
pattern <- ".*([0-9]+.[0-9]{2})%"
ind <- grep(pattern, df$Perc1, value = FALSE)
df$Perc1 <- sub(pattern, "\\1", df$Perc1)
df$Perc1[-ind] <- NA
df$Perc1 <- as.numeric(df$perc1)/100
You can use readr::parse_number
to get the number from Perc1
directly.您可以使用
readr::parse_number
直接从Perc1
获取数字。
transform(df, Perc1 = readr::parse_number(Perc1)/100)
#. Index Perc1
#1 1 0.2343
#2 2 0.1221
#3 3 NA
#4 4 0.1844
You can use regexpr
and regmatches
to extract the numbers.您可以使用
regexpr
和regmatches
来提取数字。
r <- regexpr("\\d*\\.?\\d*(?=%)", df$Perc1, perl=TRUE)
df$Perc1 <- as.numeric(`[<-`(rep(NA, length(r)), r!=-1, regmatches(df$Perc1, r))) / 100
df
# Index Perc1
#1 1 0.2343
#2 2 0.1221
#3 3 NA
#4 4 0.1844
And with your approach:并使用您的方法:
pattern <- ".*?([0-9]+.[0-9]{2})%" #Adding ? after *
ind <- grepl(pattern, df$Perc1) #Change to grepl to get logical vector
df$Perc1 <- sub(pattern, "\\1", df$Perc1)
df$Perc1[!ind] <- NA #Invert the logical vector
df$Perc1 <- as.numeric(df$Perc1)/100 #There was a typo perc1 instead of Perc1
df
# Index Perc1
#1 1 0.2343
#2 2 0.1221
#3 3 NA
#4 4 0.1844
You can str_extract
and convert the digits to numeric:您可以
str_extract
并将数字转换为数字:
library(stringr)
df$Perc1 <- as.numeric(str_extract(df$Perc1, "\\d\\d\\.\\d\\d"))/100
Result:结果:
df
Index Perc1
1 1 0.2343
2 2 0.1221
3 3 NA
4 4 0.1844
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.