[英]Group instances based on NA values in r
I am reading a csv
file and unfortunately my dataframe has many missing values. 我正在读取一个
csv
文件,不幸的是我的数据框缺少许多值。 A small snip is as following: 一个小片段如下:
df <- data.frame(Size= c(800, 850, 1100, 1200, 1000),
Value= c(900, NA, 1300, 1100, NA),
Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
Num1 = c(2, NA, 3, 2, NA),
Num2 = c(2,3,3,1,2),
Rent= c('y', 'y', 'n', 'y', 'n'))
I want to predict some of the results using weka
but I can't do it if I have multiple attributes missing. 我想使用
weka
预测一些结果,但是如果缺少多个属性,则无法做到。 I know that I should be using the function is.na
but I am not sure in what way it can be done because so far I used it only for summing and counting. 我知道我应该使用
is.na
函数,但是我不确定可以用什么方式完成,因为到目前为止,我仅将其用于求和和计数。
Edit: For an example, in this file I have missing values at 4 out of the 5 instances. 编辑:例如,在此文件中,我缺少5个实例中的4个值。 Instances 2 and 5 share the same missing attributes (B and D), while instances 1 and 4 share the same missing value as well (C).
实例2和实例5共享相同的缺失属性(B和D),而实例1和实例4也共享相同的缺失值(C)。 What I'd like to get is a dataframe that consists out of those instances so I could export them into files and run analysis on those files individually.
我想要得到的是一个由这些实例组成的数据框,因此我可以将其导出到文件中并分别对这些文件进行分析。 An example of an output could be
输出的示例可能是
> A
> B
Edit 2: 编辑2:
I want to save the splits and so far I tried this: 我想保存拆分,到目前为止,我尝试了以下操作:
write.csv(split(temp, index), file = "C:/Users/Nikita/Desktop/splits.csv", row.names=FALSE)
But it writes all the splits in one line. 但是它将所有拆分写入一行。 Is there a way to separate them by a line?
有没有办法用一条线将它们分开?
Edit 3: 编辑3:
My steps are: 我的步骤是:
data <- read.csv("location")
index <- apply(is.na(data)*1, 1,paste, collapse = "")
s <- split(data, index)
lapply(s, function(x) {names(x) <- names(data);x})
big.data <- do.call(rbind, s)
write.csv(big.data, file = "location", row.names=FALSE)
Am I missing something? 我想念什么吗?
df[!is.na(df$Value), ]
Size Value Location Num1 Num2 Rent
1 800 900 <NA> 2 2 y
3 1100 1300 uptown 3 3 n
4 1200 1100 <NA> 2 1 y
And 和
df[is.na(df$Value), ]
Size Value Location Num1 Num2 Rent
2 850 NA midcity NA 3 y
5 1000 NA Lakeview NA 2 n
In the future, please create a reproducible example so that users do not have to create a data frame by hand from your question. 将来,请创建一个可复制的示例,以使用户不必从您的问题中手动创建数据框。 Pictures are not as helpful.
图片没有帮助。
df <- data.frame(Size= c(800, 850, 1100, 1200, 1000),
Value= c(900, NA, 1300, 1100, NA),
Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
Num1 = c(2, NA, 3, 2, NA),
Num2 = c(2,3,3,1,2),
Rent= c('y', 'y', 'n', 'y', 'n'))
To combine it all use lapply since split
creates a list: 要合并所有内容,请使用lapply,因为
split
创建了一个列表:
lapply(split(temp, index), write.csv, file = "C:/Users/Nikita/Desktop/splits.csv", row.names=FALSE)
With a for loop: 使用for循环:
s <- split(temp, index)
for (i in 1:length(s)) {
write.csv(s[i], file = paste0("C:/Users/Nikita/Desktop/", i, "splits.csv"), row.names=FALSE)
}
Recreating your example data: 重新创建示例数据:
df <- data.frame(Size= c(800, 850, 1100, 1200, 1000),
Value= c(900, NA, 1300, 1100, NA),
Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
Num1 = c(2, NA, 3, 2, NA),
Num2 = c(2,3,3,1,2),
Rent= c('y', 'y', 'n', 'y', 'n'))
Now, splitting your data according to the pattern of NA as you want: 现在,根据需要按NA模式拆分数据:
# This generates an index with 1 for a column with NA and 0 otherwise
index <- apply(is.na(df)*1, 1,paste, collapse = "")
# This splits the data.frame according to the index
split(df, index)
$`000000`
Size Value Location Num1 Num2 Rent
3 1100 1300 uptown 3 3 n
$`001000`
Size Value Location Num1 Num2 Rent
1 800 900 <NA> 2 2 y
4 1200 1100 <NA> 2 1 y
$`010100`
Size Value Location Num1 Num2 Rent
2 850 NA midcity NA 3 y
5 1000 NA Lakeview NA 2 n
Notice that the first element "000000" comprises all the observations with complete cases. 注意,第一个元素“ 000000”包括所有具有完整案例的观察值。 Then "001000" comprises all observations where column 3 (location) is missing.
然后,“ 001000”包括缺少第3列(位置)的所有观察值。 And so on.
等等。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.