简体   繁体   中英

How to replace values of a column based on a condition and random sampling?

I'm working on a Gender column that has factors as values namely 'Male', 'Female' and 'Total'. The 'Total' is unneeded and so I've decided to replace half of the 'Total' values with males and the rest to be assigned females. The column is simple and I've converted all factors to numerals through the basic as.numeric(factor()) line:

Gender     NewGender
Male       1
Female     2
Total      3
Total      3
.
.
Female     2

Now the next step is to replace all the 3s with 1s and 2s but in a random order .

There are a total of 55,399 observations of which 22,057 correspond to threes in the NewGender column. I have tried some unique set of commands of which the closest one I think is:

# Experiment with 50 rows

for (row in data$NewGender[sample(which(data$NewGender, 50), ]) {
        if (row == 3) {row <- 1; row <- row + 1}
}

This generates warnings though and doesn't seem to be replacing the threes. I could well use this:

data$NewGender[data$NewGender == 3] <- 1

But I'm unable to nest it with the sample() method. What I want is Newgender containing only ones and twos with half of all the threes replaced to ones and the rest half to be twos fully randomised. Any good suggestions? Thanks in advance.

I would say that the easiest is to use sample and ifelse , also you should probably sample based on the distribution of males/females.

# Some data
gender <- sample(c("male", "female", "other"), 100, prob = c(0.4, 0.3, 0.3), replace = TRUE)

# Calculating proportion of females vs males
male_prop <- sum(gender=="male")/(sum(gender=="male")+sum(gender=="female"))
female_prop <- sum(gender=="female")/(sum(gender=="male")+sum(gender=="female"))

# Replacing other at random
gender <- ifelse(gender=="other", sample(c("male", "female"), 1, prob = c(male_prop, female_prop), replace = TRUE), gender)

Note: As in markus answer, it is a good idea to set a seed to ensure reproducibility.

You can use replace and sample .

Given a vector containing numbers from 1 to 3:

set.seed(1)
NewGender <- sample(1:3, 20, TRUE)
table(NewGender)
#NewGender
#1 2 3 
#5 7 8 

We create a logical vector that is TRUE where NewGender equals 3.

idx <- NewGender == 3

Now we replace the 3's by a sample of 1's and 2's

out <- replace(NewGender, idx, sample(1:2, sum(idx), TRUE))

Check the distribution

table(out)
#out
# 1  2 
#11  9 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM