简体   繁体   中英

R: Converting data frame of percentages from factor to numeric

Running into issues converting a data frame into R.

I have a bunch of columns that were read as factors and have % symbols with them.

I know that for a single column I could do:

df[,3] <- as.numeric(sub("%","",df[,3]))

But trying to apply this to the whole dataset does not seem to work and changes all the values to NA. What am I doing wrong? Here is the code I tried to use:

df[,-1] <- as.numeric(sub("%","",df[,-1]))

EDIT: I know I can solve this with:

for (i in 2:66) {
df[,i] <- as.numeric(sub("%","",df[,i]))
print(class(df[,i]))
}

But there has to be a more elegant (and hopefully one-line) way to do this.

EDIT 2: Here is some of the data:

    Year        v1      v2       v3       v4
1 12-Oct        0%      0%      39%      14%
2 12-Nov        0%      6%      59%       4%
3 12-Dec       22%      0%      37%      26%
4 13-Jan       45%      0%      66%      19%
5 13-Feb       28%     39%      74%      13%

ANSWERED: Here is how I did it in one command after you all helped me so much! I was having problems with specifying the function part.

df=read.csv("all response rates.csv")
df[-1]<-data.frame(apply(df[-1], 2, function(x) 
    as.numeric(sub("%","",as.character(x)))))

parse_number from the readr package will remove the % symbols. For your given data set, try:

library(dplyr)
library(readr)

res <- cbind(df %>% select(Year), # preserve the year column as-is
             df %>% select(-Year) %>% mutate_all(funs(parse_number))
             )

> res
    Year v1 v2 v3 v4
1 12-Oct  0  0 39 14
2 12-Nov  0  6 59  4
3 12-Dec 22  0 37 26
4 13-Jan 45  0 66 19
5 13-Feb 28 39 74 13

If you don't need to preserve your first column, you only need the excerpt:

df %>% select(-Year) %>% mutate_all(funs(parse_number))

Here is an option using set from data.table , which would be faster for big datasets as the overhead of [.data.table is avoided

library(stringi)
library(data.table)

setDT(df)
for(j in 2:ncol(df)){
     set(df, i=NULL, j=j, value= as.numeric(stri_extract(df[[j]], regex='\\d+')))
}

df
#     Year v1 v2 v3 v4
#1: 12-Oct  0  0 39 14
#2: 12-Nov  0  6 59  4
#3: 12-Dec 22  0 37 26
#4: 13-Jan 45  0 66 19
#5: 13-Feb 28 39 74 13

Try this approach using functions from base :

# dummy data:
df<-data.frame(v1=c("78%", "65%", "32%"), v2=c("43%", "56%", "23%"))

# function
df2<-data.frame(lapply(df, function(x) as.numeric(sub("%", "", x))) )

As per the comments provided this first strips away the percentage signs, and then converts the columns from factors to numeric. I've changed the original answer from apply to lapply following @thelatemail's suggestions.

Here is a one line solution that assumes the data is in fixed width columns. I needed to remove the first row of names since all the columns did not have names. The widths of columns are specified as integers (with negative meaning to skip that many characters.) It also changes the column classes to numeric during the read.

your data

1 12-Oct        0%      0%      39%      14%
2 12-Nov        0%      6%      59%       4%
3 12-Dec       22%      0%      37%      26%
4 13-Jan       45%      0%      66%      19%
5 13-Feb       28%     39%      74%      13%

the R one-line script

adf <- read.fwf(file="a.dat",widths=c(-8,9,-1,7,-1,8,-1,8),colClasses=rep("numeric",4))

output result (first col provided by R to count the rows)

  V1 V2 V3 V4
1  0  0 39 14
2  0  6 59  4
3 22  0 37 26
4 45  0 66 19
5 28 39 74 13

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM