简体   繁体   English

R 中的数据可视化说明,用于密度/直方图 plot

[英]Data Visualization Clarification in R for a density / histogram plot

I'm working with the Kickstarter Dataset from Kaggle and I would like to create meaningful visualization with ggplot about how display the project data about pledge ratios (this is a field I added, which is calculated by dividing the USD Pledged amount by the USD Goal amount, per project.我正在使用来自 Kaggle 的 Kickstarter数据集,我想使用ggplot创建有意义的可视化,了解如何显示有关承诺比率的项目数据(这是我添加的一个字段,通过将美元承诺金额除以美元目标计算得出金额,每个项目。

To replicate the dataset I'm using in R, please use the following code:要复制我在 R 中使用的数据集,请使用以下代码:

if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2", repos = "http://cran.us.r-project.org")
if(!require(dplyr)) install.packages("dplyr", repos = "http://cran.us.r-project.org")

library(tidyverse)
library(ggplot2)
library(dplyr)

file_path <- "https://raw.githubusercontent.com/mganopolsky/kickstarter/master/data/ks-projects-201801.csv"
data  <-read_csv(file_path)


ds <- data %>% dplyr::select(-`usd pledged`)

ds <- ds %>% mutate(time_int = as.numeric(deadline - as.Date(launched)) ,
                    launched = as.Date(launched),
                    pledged_ratio = round(usd_pledged_real / usd_goal_real, 2),
                    avg_backer_pldg = ifelse(backers == 0, 0, round(usd_pledged_real/backers) )) %>%
  mutate(launched_month = as.factor(format(launched, "%m")),
         launched_day_of_week = as.factor(format(launched, "%u")  ),
         currency = as.factor(currency),
         launched_year = as.factor(format(launched, "%Y")))


ds <- ds %>% filter(launched >= "2009-04-21")

At this point, I would like a visualization of what kind of pledge_ratio we can see across projects.在这一点上,我想要一个我们可以在项目中看到什么样的质押比率的可视化。 This data can be viewed with the following code:可以使用以下代码查看此数据:

ds %>% filter(state=="successful" ) %>% group_by(pledged_ratio) %>% summarise( pledged_ratio_count = n()) %>%
  arrange(desc(pledged_ratio)) 

This gives an idea of how many projects fall into a specific ratio - however, this number isn't really meaningful.这给出了有多少项目属于特定比例的概念 - 但是,这个数字并没有真正的意义。 A binned display of some sort would be much more preferable - for instance, using a geom_histogram() , or even a geom_density() .某种分箱显示会更可取 - 例如,使用geom_histogram() ,甚至是geom_density()

When I run the density plot, the result looks like this:当我运行密度 plot 时,结果如下所示:

ds %>% filter(state=="successful" ) %>% 
  arrange(desc(pledged_ratio))  %>% ggplot(aes(pledged_ratio)) + geom_density() + 
  ggtitle("Density Distribution of Pledge Ratios for Succeessful Projects") + xlab("Pledge Ratios") 

在此处输入图像描述

This makes sense once you stare at it for a while, because most of the projects get funded at around 100%, or a ratio of 1. However, there are some that get funded at much higher rates, and I want a visualization that will show that in a way that's not meaningless.一旦你盯着它看一会儿,这是有道理的,因为大多数项目的资助率约为 100%,或比率为 1。然而,有些项目的资助率要高得多,我想要一个可视化,它将以一种并非毫无意义的方式表明这一点。

I have tried this with histograms:我用直方图试过这个:

ds %>% filter(state=="successful" ) %>% 
  arrange(desc(pledged_ratio))  %>% ggplot(aes(pledged_ratio)) + geom_histogram(bins = 20)

and this produced another somewhat meaningless histogram:这产生了另一个毫无意义的直方图:

在此处输入图像描述

Finally, using geom_point() I got this:最后,使用 geom_point() 我得到了这个:

ds %>% filter(state=="successful" ) %>% group_by(pledged_ratio) %>% summarise( pledged_ratio_count = n()) %>%
  arrange(desc(pledged_ratio))  %>% ggplot(aes(pledged_ratio, y=pledged_ratio_count)) + geom_point()

And that resulted in this, may be the most insightful graph so far.这导致了这一点,可能是迄今为止最有洞察力的图表。 :

在此处输入图像描述

However, I am still convinced that there's got to be a better way to convey what the data is telling.但是,我仍然坚信必须有更好的方式来传达数据所传达的信息。 Any advice would be greatly appriciated.任何建议都将不胜感激。

What about an empirical CDF?那么经验 CDF 呢?

library(scales)
ds %>% filter(state=="successful") %>% 
  ggplot(aes(x=pledged_ratio)) + 
  stat_ecdf() + 
  scale_x_continuous(trans="pseudo_log", breaks = c(10, 100, 1000, 10000, 100000), labels=comma) + 
  scale_y_continuous(labels=percent) + 
  theme_bw() + 
  labs(x="Pledged Ratio", y="Percentage of Projects")

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM