[英]R Custom Functions to Clean Data
I'm trying to make a custom R script to help me clean up data before I do a bunch of fun stuff to it. 我正在尝试制作一个自定义的R脚本,以帮助我在做很多有趣的事情之前清理数据。 A lot of columns in my current data set have yes/no values and I figured it would be easier to look through if I made them binary 1/0 values.
当前数据集中的许多列都具有yes / no值,我认为如果将它们设为二进制1/0值将更容易浏览。 This current set has 10 columns that do that and while doing this ten times does work:
当前集有10列这样做,而这样做十次确实起作用:
sd$PhoneService<-ifelse(sd$PhoneService=='Yes', 1,0)
it isn't easily repeatable. 它不容易重复。 It's doable for this particular project, but there has to be a way to do it in case you had a dataset with 100 columns that needed to be converted.
这对于这个特定项目是可行的,但是如果您有一个需要转换的包含100列的数据集,则必须有一种方法来实现。 I can't just look at the number of levels it has because there are other columns that have two levels that don't make as much sense being binary.
我不能只看它具有的级别数,因为还有其他的列具有两个级别,因此对于二进制来说意义不大。 So I need a way to have R go through the table, find columns that have just two levels, check that those two levels are "yes" and "no", then convert them to 1's and 0's.
因此,我需要一种方法让R遍历表,找到只有两个级别的列,检查这两个级别分别为“是”和“否”,然后将它们转换为1和0。
This is what I have tried: 这是我尝试过的:
#Get source data
sd = read.csv("source/xyz.csv", header = T, stringsAsFactors=T)
#Clean up data
twoLevelClean <- function(b){
lvlsNames = levels(b)
ifelse(lvlsNames == "Yes", print(lvlsNames), print("Not yes no"))
}
cleanData <- function(a){
lvls = nlevels(a)
ifelse(lvls == 2, sapply(a, twoLevelClean), print("Not 2"))
}
sapply(sd, cleanData)
This just starts spitting out random outputs like this: 这只是开始吐出这样的随机输出:
...
[1] "No" "Yes"
[1] "Not yes no"
[1] "No" "Yes"
[1] "Not yes no"
[1] "No" "Yes"
[1] "Not yes no"
[1] "No" "Yes"
[1] "Not yes no"
...
I think it's running off the first column that has 1000+ unique values, but has more than 2 levels. 我认为它的第一列具有1000多个唯一值,但具有两个以上的级别。 I'm also not sure I'm going at this the right way.
我也不确定我会采用正确的方法。 Should I even be looking at levels first?
我应该先看一下水平吗? I want the twoLevelClean function to just run on the column that triggered it, but I don't think that's happening.
我希望twoLevelClean函数仅在触发它的列上运行,但是我认为这没有发生。 I think it is starting back at the beginning.
我认为这是从头开始的。
Would a for statement be better for this? for语句对此会更好吗? Can I index the columns and run certain functions on certain columns?
我可以索引列并在某些列上运行某些功能吗?
Using tidyverse
package on your original dataset, you may run the following code: 在原始数据集上使用
tidyverse
包,您可以运行以下代码:
Original_data_frame <- data.frame(
c(1:10),
c(rep("Yes",5),rep("No",5)),
c(rep("Yes",5),rep("No",5))
)
names(Original_data_frame ) <- c("id","Var1","Var2")
Using mutate_at
function of dplyr
package: 使用
dplyr
软件包的mutate_at
函数:
Original_data_frame_mod <- Original_data_frame %>%
mutate_at(.vars = vars(Var1,Var2), .funs = funs(ifelse(.=="Yes",1,0)))
Here's how you could do it: 您可以按照以下方式进行操作:
yn_to_10 = function(x) {
if (! is.factor(x)) return(x)
if (! identical(levels(x), c("no", "yes")) return(x)
return(ifelse(x == "yes", 1, 0))
}
your_data[] = lapply(your_data, yn_to_10)
But you should listen to the comments - factors are internally stored as integers (starting with 1, not 0), so changing a two-level factor to binary 0/1 doesn't really change very much. 但是您应该听一下注释-因数在内部存储为整数(从1开始,而不是0),因此将两级因数更改为二进制0/1并不会真正改变。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.