简体   繁体   中英

Sample without replacement, or duplicates, in R

I have a long list, which contains quite a few duplicates, say for example 100,000 values, 20% of which are duplicates. I want to randomly sample from this list, placing all values into groups, say 400 of them. However, I don't want any of the subsequent groups to contain duplicate values within them - ie I want all 250 members of each group to be unique.

I've tried using various permutation methods from vegan, picante, EcoSimR, but they don't do quite what I want, or seem to struggle with the large amount of data.

I wondered if there was just some way of using the sample function that I can't figure out? Any help or alternative suggestions would be much appreciated...

As noted by nico you probably just need to use the unique function. A very simple sampling program is below which ensures that there won't be duplication across the groups (which isn't totally sensible, because you could just create one big sample instead...)

# Getting some random values to use here
set.seed(seed = 14412)
thevalues <- sample(x = 1:100,size = 1000,replace = TRUE)

# Obtaining the unique vector of those values
thevalues.unique <- unique(thevalues)

# Create a sample without replacement (i.e. take the ball out and don't put it back in)
sample1 <- sample(x = thevalues.unique,size = 10,replace = FALSE)

# Remove the sampled items from the vector of values
thevalues.unique <- thevalues.unique[!(thevalues.unique %in% sample1)]

# Another sample, and another removal
sample2 <- sample(x = thevalues.unique,size = 10,replace = FALSE)
thevalues.unique <- thevalues.unique[!(thevalues.unique %in% sample2)]

To do what eipi10 mentioned and get a weighted distribution, you just need to get the frequency of the distribution first. A way of doing this:

set.seed(seed = 14412)
thevalues <- sample(x = 1:100,size = 1000,replace = TRUE,prob = c(rep(0.01,100)))

thevalues.unique <- unique(thevalues)
thevalues.unique <- thevalues.unique[order(thevalues.unique)]
thevalues.probs <- table(thevalues)/length(thevalues)
sample1 <- sample(x = thevalues.unique,
                  size = 10,
                  replace = FALSE,
                  prob = thevalues.probs)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM