So I've got a dataset where every piece of text is labelled either 'Positive' 'Neutral' or 'Negative'. Then every piece of text is assigned to an id. With each id being linked to multiple rows of the aforementioned labelled values. Now I want to be able to create 2 new columns with the ratio of positive / positive + negative + neutral (total). and ratio of negative / total.
I came up with how to be able to view te frequency of the appearance of the labels per id in a matrix. But I have no idea how to write a script to do calculations with the frequency numbers in the matrices.
A sample of the dataframe I'm working with:
category_senti artist_id
Positive 01_artist
Negative 01_artist
Positive 02_artist
Negative 02_artist
Neutral 02_artist
Negative 03_artist
Positive 03_artist
Neutral 03_artist
Negative 03_artist
Neutral 03_artist
Negative 04_artist
Positive 04_artist
..... .....
..... 23_artist
What I have been trying so far, has been successful in that you're able to see the frequency of the labels per artist_id. But I need to be able to do calculations with the frequencies in a custom written function.
data[data$artist_id == "03_artist",] %>% group_by(category_senti) %>% summarise(n=n())
# A tibble: 3 x 2
category_senti n
<fct> <int>
1 Negative 59
2 Neutral 157
3 Positive 165
I'm hoping to create two new dataframes: data$pos_ratio, and data$neg_ratio with the corresponding ratios of the times the label 'Pos', 'Neg' appears divided by total for every artist_id.
So ideally the pos_ratio dataframe would look like this:
artist_id pos_ratio
01_artist 0.4764
02_artist 0.3566
03_artist 0.8472
04_artist 0.3058
05_artist 0.2056
06_artist 0.2534
..... ......
Thanks in advance!
We can group by 'artist_id', create a column 'n' with the frequency count, then grouped by 'category_senti', take the ratio of the frequency with the frequency column earlier created, and split by the 'category_senti' to a list
of data.frames
library(dplyr)
data %>%
group_by(artist_id) %>%
mutate(n = n()) %>%
group_by(category_senti, add = TRUE) %>%
summarise(ratio = n()/n[1]) %>%
ungroup %>%
group_split(category_senti, keep = FALSE)
You can calculate the ratio of positive values with mean(category_senti == 'Positive')
, and similar for negatives.
library(data.table)
setDT(df)
out <-
df[, .(pos_ratio = mean(category_senti == 'Positive'),
neg_ratio = mean(category_senti == 'Negative'))
, by = artist_id]
# artist_id pos_ratio neg_ratio
# 1: 01_artist 0.5000000 0.5000000
# 2: 02_artist 0.3333333 0.3333333
# 3: 03_artist 0.2000000 0.4000000
# 4: 04_artist 0.5000000 0.5000000
If you want to look at positive or negative as a separate dataset you can just subset out
out[, .(artist_id, neg_ratio)]
# artist_id neg_ratio
# 1: 01_artist 0.5000000
# 2: 02_artist 0.3333333
# 3: 03_artist 0.4000000
# 4: 04_artist 0.5000000
data used
df <- fread('
category_senti artist_id
Positive 01_artist
Negative 01_artist
Positive 02_artist
Negative 02_artist
Neutral 02_artist
Negative 03_artist
Positive 03_artist
Neutral 03_artist
Negative 03_artist
Neutral 03_artist
Negative 04_artist
Positive 04_artist
')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.