For a class project I have a set of tweets categorized into 3 types of speech: hate, regular, and offensive. My goal is to eventually train a classifier to predict the correct type of tweet from the data.
I have a tibble of the data in a tidy format (one word per row) containing each word's TF-IDF score. I have censored the offensive language with asterisks:
> tfidf_words
# A tibble: 34,717 x 7
speech tweet_id word n tf idf tf_idf
<fct> <int> <chr> <int> <dbl> <dbl> <dbl>
1 hate 24282747 reason 1 0.25 5.69 1.42
2 hate 24282747 usd 1 0.25 8.73 2.18
3 hate 24282747 bunch 1 0.25 5.60 1.40
4 hate 24282747 ****** 1 0.25 5.21 1.30
5 hate 24284443 sand 1 0.5 4.76 2.38
6 hate 24284443 ****** 1 0.5 2.49 1.24
7 hate 24324552 madden 1 0.111 8.73 0.970
8 hate 24324552 call 1 0.111 4.11 0.456
9 hate 24324552 ****** 1 0.111 2.05 0.228
10 hate 24324552 set 1 0.111 5.90 0.655
# ... with 34,707 more rows
To limit the size of my training feature space I want to get the first "n" unique words of each speech type based on their TF-IDF scores.
My vocabulary
is a vector of all the unique words chosen for my feature space, defined as vocabulary <- unique(feature_space$word)
In my program I use SENTIMENT_SIZE
to define how many words of each speech type I want in my model.
I have tried both this:
feature_space <- tfidf_words %>%
arrange(desc(tf_idf)) %>%
group_by(speech) %>%
slice(1:SENTIMENT_SIZE) %>%
ungroup() %>%
arrange(tweet_id)
and this:
feature_space <- tfidf_words %>%
group_by(speech) %>%
top_n(n = SENTIMENT_SIZE, wt = tf_idf) %>%
ungroup()
These both "sort of" work, but they both do not handle duplicates the way I would like. If, for example, I set SENTIMENT_SIZE
to be 100 I would want to see 100 unique words from each speech type chosen for a total of 300 words.
Instead, we have this result for method 1:
> length(vocabulary)
[1] 248
And this result for method 2:
> length(vocabulary)
[1] 293
How can I:
Here I assume you are looking for the unique word
within each group of speech
tfidf_words %>% arrange(desc(tf_idf)) %>%
group_by(speech) %>% distinct(word, .keep_all = TRUE)
Thanks to @A. Suliman I think I have something that works now.
feature_space <- tfidf_words %>%
arrange(desc(tf_idf)) %>%
distinct(word, .keep_all = TRUE) %>% #remove all duplicate words
group_by(speech) %>%
slice(1:SENTIMENT_SIZE) %>% #grab first n of each speech category
ungroup()
This should always produce the expected number of words in my vocabulary, because it preemptively removes any chance of a tie.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.