简体   繁体   English

R:多用户响应数据帧上的直方图和密度

[英]R: Histogram and Density on multiple user response data frame

Description of Data 数据描述

Data reflect how users rated a book on an online book recommendation site while answering a question which has four answers. 数据反映用户在回答有四个答案的问题时如何在在线图书推荐网站上评价图书。 Users were allowed to choose more than one answer. 允许用户选择多个答案。

Goal is to obtain distribution plots by gender where X axis as answer (X1,X2..) and Y axis as the count of books along with density overlay. 目标是按性别获得分布图,其中X轴为答案(X1,X2..)Y轴为书籍计数以及密度叠加。 It would be great for both male and female to be overlay one another. 对于男性和女性来说,彼此重叠是很好的。

book_id  user_id  rate  X1   X2   X3    X4  Gender  genre
40         001     4.5    0    1    0    0  male    fiction
48         001     3.5    1    0    0    1  male    fiction
54         001     4.0    1    0    0    0  male    fiction
79         001     2.5    1    0    1    0  male    non-fiction
80         001     4.5    0    0    1    0  male    non-fiction
95         001     5.0    1    0    1    0  male    non-fiction
95         002     3.0    0    0    0    1  Female  non-fiction
99         002     4.5    0    0    1    0  Female  non-fiction
02         002     0.5    0    0    0    0  Female  non-fiction
05         002     4.5    1    0    1    0  Female  non-fiction
54         002     4.0    0    1    0    0  Female  fiction
79         002     2.5    1    0    1    0  Female  non-fiction
80         002     4.5    0    0    1    0  Female  non-fiction
07         002     4.5    1    0    1    0  Female  fiction
07         003     5.0    1    0    1    0  Female  fiction
09         003     4.0    0    0    1    0  Female  auto-bio
54         003     4.0    1    0    0    0  Female  fiction
79         003     2.5    1    0    1    0  Female  non-fiction
80         003     4.5    0    0    1    0  Female  non-fction
17         004     3.5    1    0    0    0  male    auto-bio
21         004     5.0    1    0    1    0  male    auto-bio
21         005     5.0    0    1    1    0  male    auto-bio
17         005     0.5    0    0    0    1  male    auto-bio
20         005     5.0    0    0    1    0  male    fiction
20         006     1.5    0    0    0    1  male    fiction
21         006     5.0    0    0    1    0  male    auto-bio
21         007     2.0    1    0    0    0  male    auto-bio
21         008     4.5    1    0    1    0  Female  auto-bio
20         008     4.5    1    0    1    0  Female  fiction
07         008     4.5    1    0    1    0  Female  fiction
22         009     5.0    0    0    1    0  male    fiction
54         009     4.0    1    0    0    0  male    fiction
79         009     2.5    1    0    1    0  male    non-fiction
80         010     4.5    1    0    1    0  male    non-fiction
22         010     4.5    0    1    1    0  male    fiction
22         011     0.5    0    0    1    0  Female  fiction
28         011     3.5    1    0    0    0  Female  auto-bio

Two users can rate the same book and answer the question in the same way or different way. 两个用户可以对同一本书进行评分,并以相同或不同的方式回答问题。 This creates two records per each book. 这会为每本书创建两个记录。 With that in mind, If group by Gender and sum each column down would give gender level distribution to start with. 考虑到这一点,如果按Gender分组并将每列相加,则会开始提供性别级别分布。

df %>% group_by(Gender) %>% summarize(x1 = sum(X1), x2 = sum(X2), x3=sum(X3),x4 =sum(X4))

  Gender    x1    x2    x3    x4
  <fct>  <int> <int> <int> <int>
1 Female    10     1    13     1
2 male      10     3    11     3

In addition to getting the plot: I also have the following question: Also just to confirm this is not the unique number of books female answer x1 since the same book can be answered by multiple users. 除了得到情节:我还有以下问题:也只是为了确认这不是女性回答x1的独特书籍数量,因为同一本书可以被多个用户回答。 Instead, it would be number of female choose a specific answer? 相反,它会是女性选择一个具体答案的数量?

A similar but different approach 一种类似但不同的方法

library(data.table)
library(ggplot2)
dt <- setDT(dt)

plottest <- melt(dt,measure.vars = patterns("^X"),variable.name = "question", value.name = "answer")

ggplot(data = plottest,aes(factor(book_id),answer))+
  geom_col(aes(fill = as.factor(question), color = as.factor(question) ))+
  facet_wrap(~Gender)+
  labs(title =  "",
       y = "N",
       x = "books",
       color = "Question",
       fill = "Question")

在此输入图像描述

I am not sure I understand correctly but is the following code what you want? 我不确定我是否理解正确,但以下代码是您想要的?

library(dplyr)
library(ggplot2)

df2 <- df %>% 
  group_by(Gender) %>% 
  summarize(x1 = sum(X1), x2 = sum(X2), x3=sum(X3),x4 =sum(X4)) %>%
  melt(id.vars = "Gender")


ggplot(df2, aes(variable, value, color = Gender, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge")

在此输入图像描述

After seeing the answer by @denis I adapted his code to do more or less the same but with position = "dodge" . 在看到@denis的答案后,我调整了他的代码,或多或少地做了相同的但是使用position = "dodge"

df3 <- df %>% 
  group_by(Gender, book_id) %>% 
  summarize(x1 = sum(X1), x2 = sum(X2), x3=sum(X3),x4 =sum(X4)) %>%
  melt(id.vars = c("Gender", "book_id"))

ggplot(df3, aes(as.factor(book_id), value, color = variable, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ Gender)

在此输入图像描述

As for the second question, you can use aggregate to get the answers to each question by Gender . 至于第二个问题,您可以使用aggregate来按Gender获得每个问题的答案。

aggregate(. ~ Gender, df[4:8], sum)
#  Gender X1 X2 X3 X4
#1 Female 10  1 13  1
#2   male 10  3 11  3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM