I am reading a csv
file and unfortunately my dataframe has many missing values. A small snip is as following:
df <- data.frame(Size= c(800, 850, 1100, 1200, 1000),
Value= c(900, NA, 1300, 1100, NA),
Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
Num1 = c(2, NA, 3, 2, NA),
Num2 = c(2,3,3,1,2),
Rent= c('y', 'y', 'n', 'y', 'n'))
I want to predict some of the results using weka
but I can't do it if I have multiple attributes missing. I know that I should be using the function is.na
but I am not sure in what way it can be done because so far I used it only for summing and counting.
Edit: For an example, in this file I have missing values at 4 out of the 5 instances. Instances 2 and 5 share the same missing attributes (B and D), while instances 1 and 4 share the same missing value as well (C). What I'd like to get is a dataframe that consists out of those instances so I could export them into files and run analysis on those files individually. An example of an output could be
> A
> B
Edit 2:
I want to save the splits and so far I tried this:
write.csv(split(temp, index), file = "C:/Users/Nikita/Desktop/splits.csv", row.names=FALSE)
But it writes all the splits in one line. Is there a way to separate them by a line?
Edit 3:
My steps are:
data <- read.csv("location")
index <- apply(is.na(data)*1, 1,paste, collapse = "")
s <- split(data, index)
lapply(s, function(x) {names(x) <- names(data);x})
big.data <- do.call(rbind, s)
write.csv(big.data, file = "location", row.names=FALSE)
Am I missing something?
df[!is.na(df$Value), ]
Size Value Location Num1 Num2 Rent
1 800 900 <NA> 2 2 y
3 1100 1300 uptown 3 3 n
4 1200 1100 <NA> 2 1 y
And
df[is.na(df$Value), ]
Size Value Location Num1 Num2 Rent
2 850 NA midcity NA 3 y
5 1000 NA Lakeview NA 2 n
In the future, please create a reproducible example so that users do not have to create a data frame by hand from your question. Pictures are not as helpful.
df <- data.frame(Size= c(800, 850, 1100, 1200, 1000),
Value= c(900, NA, 1300, 1100, NA),
Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
Num1 = c(2, NA, 3, 2, NA),
Num2 = c(2,3,3,1,2),
Rent= c('y', 'y', 'n', 'y', 'n'))
To combine it all use lapply since split
creates a list:
lapply(split(temp, index), write.csv, file = "C:/Users/Nikita/Desktop/splits.csv", row.names=FALSE)
With a for loop:
s <- split(temp, index)
for (i in 1:length(s)) {
write.csv(s[i], file = paste0("C:/Users/Nikita/Desktop/", i, "splits.csv"), row.names=FALSE)
}
Recreating your example data:
df <- data.frame(Size= c(800, 850, 1100, 1200, 1000),
Value= c(900, NA, 1300, 1100, NA),
Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
Num1 = c(2, NA, 3, 2, NA),
Num2 = c(2,3,3,1,2),
Rent= c('y', 'y', 'n', 'y', 'n'))
Now, splitting your data according to the pattern of NA as you want:
# This generates an index with 1 for a column with NA and 0 otherwise
index <- apply(is.na(df)*1, 1,paste, collapse = "")
# This splits the data.frame according to the index
split(df, index)
$`000000`
Size Value Location Num1 Num2 Rent
3 1100 1300 uptown 3 3 n
$`001000`
Size Value Location Num1 Num2 Rent
1 800 900 <NA> 2 2 y
4 1200 1100 <NA> 2 1 y
$`010100`
Size Value Location Num1 Num2 Rent
2 850 NA midcity NA 3 y
5 1000 NA Lakeview NA 2 n
Notice that the first element "000000" comprises all the observations with complete cases. Then "001000" comprises all observations where column 3 (location) is missing. And so on.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.