I'm taking an Advanced Business Analysis class for school and we're learning to program in R Studio.
The professor shared a hint to help us solve a problem, but I'm unable to get it to work.
I'm trying to set the mean height by gender for any height values that contain NA.
Here's what the professor shared as a solution to the problem, but it doesn't work. Nothing gets updated in the data table:
data$height[is.na(data$height) && data$gender == "female"] = data$height[data$gender=="female"]
I tried this:
data$height[is.na(data$height) && data$gender == "female"] = mean(data$height[data$gender=="female"])
and this:
data$height[is.na(data$height) && data$gender == "female"] = mean(data$height[data$gender=="female"], na.rm = TRUE)
But got this error:
In mean.default(data$height[data$gender == "female"]) : argument is not numeric or logical: returning NA
I calculated the mean height of each gender and tried it this way, but that didn't work either. In all scenarios, the height still displays “NA”.
femaleMeanHeight = mean(data$height[data$gender=="female"], na.rm = TRUE)
data$height[is.na(data$height) && data$gender == "female"] = femaleMeanHeight
I don't know where else to go. Any help is greatly appreciated.
There two problems with your code. The first is in data$height[is.na(data$height) && data$gender == "female"]
and the second is in mean(data$height[data$gender=="female"])
.
We start with the second problem - you already solved it. Calculating a mean and including NA will result in NA. Therefore you set rm.na = TRUE
, so the NAs will be ignored. (Replacing NA with NA doesn't make sense or a difference )
The first problem is the && part. There is a difference between & and &&. Just use & instead of && and your code might run.
data$height[is.na(data$height) & data$gender == "female"] = mean(data$height[data$gender=="female"], na.rm = TRUE)
Like I mentioned && and & have different meanings.
& does exaclty what you want. It tests for every row if your two conditions are true or false (Is height
NA and is gender
female?). The result will be a vector (for each row one logical) for example TRUE, FALSE, TRUE, FALSE
(The first and third row meet the condtions). The new mean height will just overwrite the height in the rows with TRUE
. --> That's what you want.
&& will only test the first row. So you just get one TRUE
or FALSE
. If your first row has NA in height
and female in gender
you get a TRUE
. And your whole dataset will be overwritten with the mean (data$height[TRUE] - would mean everything in the column height). If your first row is not female or height has a value, the result will be FALSE
. So no height will be overwriten with the mean height.
So the reason for nothing worked might be that your first row didn't match your conditions - therefore the result was FALSE
. And overwrite data$height[FALSE] with mean implys replace NA with the mean height in no row at all.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.