如何用正则表达式清理 dataframe 列？

Question

Consider this dataframe:考虑这个 dataframe：

df <- data.frame(Index=c(1:4),
                  Perc1=c("SC(23.43%","12.21%","","(18.44%"))
  Index     Perc1
1     1 SC(23.43%
2     2    12.21%
3     3          
4     4   (18.44%

The goal is to clean up its column Perc1 with regex.目标是用正则表达式清理它的Perc1列。

Desired result:期望的结果：

  Index  Perc1
1     1 0.2343
2     2 0.1221
3     3       
4     4 0.1844

I tried the following code, but I get an error and a wrong result.我尝试了以下代码，但出现错误和错误结果。

pattern <- ".*([0-9]+.[0-9]{2})%"
ind <- grep(pattern, df$Perc1, value = FALSE)
df$Perc1 <- sub(pattern, "\\1", df$Perc1)
df$Perc1[-ind] <- NA
df$Perc1 <- as.numeric(df$perc1)/100

Answer 1

You can use readr::parse_number to get the number from Perc1 directly.您可以使用readr::parse_number直接从Perc1获取数字。

transform(df, Perc1 = readr::parse_number(Perc1)/100)

#. Index  Perc1
#1     1 0.2343
#2     2 0.1221
#3     3     NA
#4     4 0.1844

Answer 2

You can use regexpr and regmatches to extract the numbers.您可以使用regexpr和regmatches来提取数字。

r <- regexpr("\\d*\\.?\\d*(?=%)", df$Perc1, perl=TRUE)
df$Perc1 <- as.numeric(`[<-`(rep(NA, length(r)), r!=-1, regmatches(df$Perc1, r))) / 100
df
#  Index  Perc1
#1     1 0.2343
#2     2 0.1221
#3     3     NA
#4     4 0.1844

And with your approach:并使用您的方法：

pattern <- ".*?([0-9]+.[0-9]{2})%"   #Adding ? after *
ind <- grepl(pattern, df$Perc1)      #Change to grepl to get logical vector
df$Perc1 <- sub(pattern, "\\1", df$Perc1)
df$Perc1[!ind] <- NA                 #Invert the logical vector
df$Perc1 <- as.numeric(df$Perc1)/100 #There was a typo perc1 instead of Perc1
df
#  Index  Perc1
#1     1 0.2343
#2     2 0.1221
#3     3     NA
#4     4 0.1844

Answer 3

You can str_extract and convert the digits to numeric:您可以str_extract并将数字转换为数字：

library(stringr)
df$Perc1 <- as.numeric(str_extract(df$Perc1, "\\d\\d\\.\\d\\d"))/100

Result:结果：

df
  Index  Perc1
1     1 0.2343
2     2 0.1221
3     3     NA
4     4 0.1844

如何用正则表达式清理 dataframe 列？

问题描述

3 个解决方案

解决方案1
2 已采纳 2021-04-28 06:49:34

解决方案2
1 2021-04-28 07:24:27

解决方案3
1 2021-04-28 07:57:43

如何用正则表达式清理 dataframe 列？

问题描述

3 个解决方案

解决方案1 2 已采纳 2021-04-28 06:49:34

解决方案2 1 2021-04-28 07:24:27

解决方案3 1 2021-04-28 07:57:43

解决方案1
2 已采纳 2021-04-28 06:49:34

解决方案2
1 2021-04-28 07:24:27

解决方案3
1 2021-04-28 07:57:43