I am in a statistical project, I have a table with words and the frequency that each one has in a text, what I want is a sample that has as a result the words that have the most frequency
Hello good afternoon, I hope someone can help me.
I have a table with words and how often each one appears in a text.
word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq <- c("10", "2", "5", "8", "2", "1")
table < -cbind.data.frame(word,freq)
# word freq
# 1 banana 10
# 2 watermelon 2
# 3 water 5
# 4 apple 8
# 5 blue 2
# 6 sky 1
sample(table$freq,2)
# [1] 2 5
word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq<- c("10", "2", "5", "8", "2", "1")
table<-cbind.data.frame(word,freq)
sample(table$freq,2)
I want is:
# [1] 10 8
If you want weighted probability of words based on your freq
(converted to integer
), then perhaps
sample(tb$freq, size = 2, prob = tb$freq)
Let's see what the tendency is for this to prioritize the words we think we should be getting. For demonstration, I'll sample the word
based on their freq
(since that makes more sense to me), you can move variables around as you see fit.
samps <- replicate(1000, sample(tb$word, size = 2, prob = tb$freq))
str(samps)
# chr [1:2, 1:1000] "water" "apple" "water" "banana" "watermelon" "banana" ...
sort(table(samps))
# samps
# sky watermelon blue water apple banana
# 93 151 166 370 572 648
The replicate
call gives us a matrix
, so sorting the frequencies, we see that banana
is more likely than all others.
We can see that the proportions are about right with
sort(table(samps)) / sum(table(samps))
# samps
# sky watermelon blue water apple banana
# 0.0465 0.0755 0.0830 0.1850 0.2860 0.3240
tb$pct <- tb$freq / sum(tb$freq)
tb <- tb[ order(tb$pct), ]
tb
# freq word pct
# 6 1 sky 0.03571429
# 2 2 watermelon 0.07142857
# 5 2 blue 0.07142857
# 3 5 water 0.17857143
# 4 8 apple 0.28571429
# 1 10 banana 0.35714286
Data
word <- c("banana", "watermelon", "water", "apple", "blue", "sky")
freq <- as.integer(c("10", "2", "5", "8", "2", "1"))
tb <- data.frame(freq, word)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.