简体   繁体   中英

Random sample with probability to proportion to size

I am in a statistical project, I have a table with words and the frequency that each one has in a text, what I want is a sample that has as a result the words that have the most frequency

Hello good afternoon, I hope someone can help me.

I have a table with words and how often each one appears in a text.

word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq <- c("10", "2", "5", "8", "2", "1")

table < -cbind.data.frame(word,freq)
#        word    freq
# 1     banana   10
# 2 watermelon    2
# 3      water    5
# 4      apple    8
# 5       blue    2
# 6        sky    1

sample(table$freq,2)
# [1] 2 5

word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq<- c("10", "2", "5", "8", "2", "1")

table<-cbind.data.frame(word,freq)
sample(table$freq,2)

I want is:

# [1] 10 8

If you want weighted probability of words based on your freq (converted to integer ), then perhaps

sample(tb$freq, size = 2, prob = tb$freq)

Let's see what the tendency is for this to prioritize the words we think we should be getting. For demonstration, I'll sample the word based on their freq (since that makes more sense to me), you can move variables around as you see fit.

samps <- replicate(1000, sample(tb$word, size = 2, prob = tb$freq))
str(samps)
#  chr [1:2, 1:1000] "water" "apple" "water" "banana" "watermelon" "banana" ...
sort(table(samps))
# samps
#        sky watermelon       blue      water      apple     banana 
#         93        151        166        370        572        648 

The replicate call gives us a matrix , so sorting the frequencies, we see that banana is more likely than all others.

We can see that the proportions are about right with

sort(table(samps)) / sum(table(samps))
# samps
#        sky watermelon       blue      water      apple     banana 
#     0.0465     0.0755     0.0830     0.1850     0.2860     0.3240 
tb$pct <- tb$freq / sum(tb$freq)
tb <- tb[ order(tb$pct), ]
tb
#   freq       word        pct
# 6    1        sky 0.03571429
# 2    2 watermelon 0.07142857
# 5    2       blue 0.07142857
# 3    5      water 0.17857143
# 4    8      apple 0.28571429
# 1   10     banana 0.35714286

Data

word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq <- as.integer(c("10", "2", "5", "8", "2", "1"))
tb <- data.frame(freq, word)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM