简体   繁体   中英

How can I replace NA values for a specific condition in RStudio?

I'm taking an Advanced Business Analysis class for school and we're learning to program in R Studio.

The professor shared a hint to help us solve a problem, but I'm unable to get it to work.

I'm trying to set the mean height by gender for any height values that contain NA.

Here's what the professor shared as a solution to the problem, but it doesn't work. Nothing gets updated in the data table:

data$height[is.na(data$height) && data$gender == "female"] = data$height[data$gender=="female"]

I tried this:

data$height[is.na(data$height) && data$gender == "female"] = mean(data$height[data$gender=="female"])

and this:

data$height[is.na(data$height) && data$gender == "female"] = mean(data$height[data$gender=="female"], na.rm = TRUE)

But got this error:

In mean.default(data$height[data$gender == "female"]) :  argument is not numeric or logical: returning NA

I calculated the mean height of each gender and tried it this way, but that didn't work either. In all scenarios, the height still displays “NA”.

femaleMeanHeight = mean(data$height[data$gender=="female"], na.rm = TRUE)
data$height[is.na(data$height) && data$gender == "female"] = femaleMeanHeight

I don't know where else to go. Any help is greatly appreciated.

There two problems with your code. The first is in data$height[is.na(data$height) && data$gender == "female"] and the second is in mean(data$height[data$gender=="female"]) .

We start with the second problem - you already solved it. Calculating a mean and including NA will result in NA. Therefore you set rm.na = TRUE , so the NAs will be ignored. (Replacing NA with NA doesn't make sense or a difference )

The first problem is the && part. There is a difference between & and &&. Just use & instead of && and your code might run.

data$height[is.na(data$height) & data$gender == "female"] = mean(data$height[data$gender=="female"], na.rm = TRUE)

Like I mentioned && and & have different meanings.

& does exaclty what you want. It tests for every row if your two conditions are true or false (Is height NA and is gender female?). The result will be a vector (for each row one logical) for example TRUE, FALSE, TRUE, FALSE (The first and third row meet the condtions). The new mean height will just overwrite the height in the rows with TRUE . --> That's what you want.

&& will only test the first row. So you just get one TRUE or FALSE . If your first row has NA in height and female in gender you get a TRUE . And your whole dataset will be overwritten with the mean (data$height[TRUE] - would mean everything in the column height). If your first row is not female or height has a value, the result will be FALSE . So no height will be overwriten with the mean height.

So the reason for nothing worked might be that your first row didn't match your conditions - therefore the result was FALSE . And overwrite data$height[FALSE] with mean implys replace NA with the mean height in no row at all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM