简体   繁体   中英

Need help calculating ratio of non numeric value in rows in dataframe in R

So I've got a dataset where every piece of text is labelled either 'Positive' 'Neutral' or 'Negative'. Then every piece of text is assigned to an id. With each id being linked to multiple rows of the aforementioned labelled values. Now I want to be able to create 2 new columns with the ratio of positive / positive + negative + neutral (total). and ratio of negative / total.

I came up with how to be able to view te frequency of the appearance of the labels per id in a matrix. But I have no idea how to write a script to do calculations with the frequency numbers in the matrices.

A sample of the dataframe I'm working with:

category_senti        artist_id
Positive              01_artist
Negative              01_artist
Positive              02_artist
Negative              02_artist
Neutral               02_artist
Negative              03_artist
Positive              03_artist
Neutral               03_artist
Negative              03_artist
Neutral               03_artist
Negative              04_artist
Positive              04_artist
.....                 .....
.....                 23_artist

What I have been trying so far, has been successful in that you're able to see the frequency of the labels per artist_id. But I need to be able to do calculations with the frequencies in a custom written function.

data[data$artist_id == "03_artist",] %>% group_by(category_senti) %>% summarise(n=n())

# A tibble: 3 x 2
  category_senti     n
  <fct>          <int>
1 Negative          59
2 Neutral          157
3 Positive         165

I'm hoping to create two new dataframes: data$pos_ratio, and data$neg_ratio with the corresponding ratios of the times the label 'Pos', 'Neg' appears divided by total for every artist_id.

So ideally the pos_ratio dataframe would look like this:

artist_id   pos_ratio
01_artist   0.4764
02_artist   0.3566
03_artist   0.8472
04_artist   0.3058
05_artist   0.2056
06_artist   0.2534
.....       ......

Thanks in advance!

We can group by 'artist_id', create a column 'n' with the frequency count, then grouped by 'category_senti', take the ratio of the frequency with the frequency column earlier created, and split by the 'category_senti' to a list of data.frames

library(dplyr)
data %>% 
    group_by(artist_id) %>%
    mutate(n = n()) %>%
    group_by(category_senti, add = TRUE) %>%
    summarise(ratio = n()/n[1]) %>%
    ungroup %>%
    group_split(category_senti, keep = FALSE)

You can calculate the ratio of positive values with mean(category_senti == 'Positive') , and similar for negatives.

library(data.table)
setDT(df)

out <- 
  df[, .(pos_ratio = mean(category_senti == 'Positive'),
          neg_ratio = mean(category_senti == 'Negative'))
     , by = artist_id]

#    artist_id pos_ratio neg_ratio
# 1: 01_artist 0.5000000 0.5000000
# 2: 02_artist 0.3333333 0.3333333
# 3: 03_artist 0.2000000 0.4000000
# 4: 04_artist 0.5000000 0.5000000

If you want to look at positive or negative as a separate dataset you can just subset out

out[, .(artist_id, neg_ratio)]
#    artist_id neg_ratio
# 1: 01_artist 0.5000000
# 2: 02_artist 0.3333333
# 3: 03_artist 0.4000000
# 4: 04_artist 0.5000000

data used

df <- fread('
category_senti        artist_id
Positive              01_artist
Negative              01_artist
Positive              02_artist
Negative              02_artist
Neutral               02_artist
Negative              03_artist
Positive              03_artist
Neutral               03_artist
Negative              03_artist
Neutral               03_artist
Negative              04_artist
Positive              04_artist
')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM