简体   繁体   中英

R Custom Functions to Clean Data

I'm trying to make a custom R script to help me clean up data before I do a bunch of fun stuff to it. A lot of columns in my current data set have yes/no values and I figured it would be easier to look through if I made them binary 1/0 values. This current set has 10 columns that do that and while doing this ten times does work:

sd$PhoneService<-ifelse(sd$PhoneService=='Yes', 1,0)

it isn't easily repeatable. It's doable for this particular project, but there has to be a way to do it in case you had a dataset with 100 columns that needed to be converted. I can't just look at the number of levels it has because there are other columns that have two levels that don't make as much sense being binary. So I need a way to have R go through the table, find columns that have just two levels, check that those two levels are "yes" and "no", then convert them to 1's and 0's.

This is what I have tried:

#Get source data
sd = read.csv("source/xyz.csv", header = T, stringsAsFactors=T)

#Clean up data
twoLevelClean <- function(b){
  lvlsNames = levels(b)
  ifelse(lvlsNames == "Yes", print(lvlsNames), print("Not yes no"))
}

cleanData <- function(a){
  lvls = nlevels(a)
  ifelse(lvls == 2, sapply(a, twoLevelClean), print("Not 2"))
}

sapply(sd, cleanData)

This just starts spitting out random outputs like this:

...
[1] "No"  "Yes"
[1] "Not yes no"
[1] "No"  "Yes"
[1] "Not yes no"
[1] "No"  "Yes"
[1] "Not yes no"
[1] "No"  "Yes"
[1] "Not yes no"
...

I think it's running off the first column that has 1000+ unique values, but has more than 2 levels. I'm also not sure I'm going at this the right way. Should I even be looking at levels first? I want the twoLevelClean function to just run on the column that triggered it, but I don't think that's happening. I think it is starting back at the beginning.

Would a for statement be better for this? Can I index the columns and run certain functions on certain columns?

Using tidyverse package on your original dataset, you may run the following code:

Original_data_frame <- data.frame(
    c(1:10),
    c(rep("Yes",5),rep("No",5)),
    c(rep("Yes",5),rep("No",5))
)

names(Original_data_frame ) <- c("id","Var1","Var2")

Using mutate_at function of dplyr package:

Original_data_frame_mod <- Original_data_frame %>% 
    mutate_at(.vars = vars(Var1,Var2), .funs = funs(ifelse(.=="Yes",1,0)))

Here's how you could do it:

yn_to_10 = function(x) {
    if (! is.factor(x)) return(x)
    if (! identical(levels(x), c("no", "yes")) return(x)
    return(ifelse(x == "yes", 1, 0))
}

your_data[] = lapply(your_data, yn_to_10)

But you should listen to the comments - factors are internally stored as integers (starting with 1, not 0), so changing a two-level factor to binary 0/1 doesn't really change very much.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM