Random sample with probability to proportion to size

Question

I am in a statistical project, I have a table with words and the frequency that each one has in a text, what I want is a sample that has as a result the words that have the most frequency

Hello good afternoon, I hope someone can help me.

I have a table with words and how often each one appears in a text.

word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq <- c("10", "2", "5", "8", "2", "1")

table < -cbind.data.frame(word,freq)
#        word    freq
# 1     banana   10
# 2 watermelon    2
# 3      water    5
# 4      apple    8
# 5       blue    2
# 6        sky    1

sample(table$freq,2)
# [1] 2 5

word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq<- c("10", "2", "5", "8", "2", "1")

table<-cbind.data.frame(word,freq)
sample(table$freq,2)

I want is:

# [1] 10 8

Answer 1

If you want weighted probability of words based on your freq (converted to integer ), then perhaps

sample(tb$freq, size = 2, prob = tb$freq)

Let's see what the tendency is for this to prioritize the words we think we should be getting. For demonstration, I'll sample the word based on their freq (since that makes more sense to me), you can move variables around as you see fit.

samps <- replicate(1000, sample(tb$word, size = 2, prob = tb$freq))
str(samps)
#  chr [1:2, 1:1000] "water" "apple" "water" "banana" "watermelon" "banana" ...
sort(table(samps))
# samps
#        sky watermelon       blue      water      apple     banana 
#         93        151        166        370        572        648

The replicate call gives us a matrix , so sorting the frequencies, we see that banana is more likely than all others.

We can see that the proportions are about right with

sort(table(samps)) / sum(table(samps))
# samps
#        sky watermelon       blue      water      apple     banana 
#     0.0465     0.0755     0.0830     0.1850     0.2860     0.3240 
tb$pct <- tb$freq / sum(tb$freq)
tb <- tb[ order(tb$pct), ]
tb
#   freq       word        pct
# 6    1        sky 0.03571429
# 2    2 watermelon 0.07142857
# 5    2       blue 0.07142857
# 3    5      water 0.17857143
# 4    8      apple 0.28571429
# 1   10     banana 0.35714286

Data

word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq <- as.integer(c("10", "2", "5", "8", "2", "1"))
tb <- data.frame(freq, word)

Random sample with probability to proportion to size

Question

1 answers

solution1
2 2019-07-10 18:07:40

Random sample with probability to proportion to size

Question

1 answers

solution1 2 2019-07-10 18:07:40

solution1
2 2019-07-10 18:07:40