简体   繁体   English

在 R 中制作桑基图

[英]Making a Sankey Diagram in R

I'm trying to create a Sankey Diagram.我正在尝试创建一个桑基图。 I am using R with either {plotly} or {networkD3} packages.我将 R 与 {plotly} 或 {networkD3} 包一起使用。 Both ask for the same type of data: source, target, value.两者都要求相同类型的数据:源、目标、值。 I'm not really sure what source, target, and value is supposed to be and how to aggregate my data to this format.我不确定应该是什么来源、目标和价值,以及如何将我的数据聚合成这种格式。 I have the following:我有以下内容:

data.frame(
  UniqID = rep(c(1:10), times=4), 
  Year = c(rep("2005", times=10), rep("2010", times=10), rep("2015", times=10), rep("2020", times=10)),
  Response_Variable = round(runif(n = 40, min = 0, max = 2), digits = 0)
)

The response variable is a categorical variable of 0, 1, or 2. I would like to show the flow of the classes of this variable from one year to the next.响应变量是 0、1 或 2 的分类变量。我想展示这个变量的类从一年到下一年的流动。 The final product should look something like this:最终产品应如下所示:

在此处输入图像描述

In my case, "Wave" would be Year and "Outcome" would be the classes (0, 1, 2) of the response variable.在我的例子中,“Wave”将是Year ,“Outcome”将是响应变量的类别 (0, 1, 2)。

You don't really have enough information in your data to make a chart exactly like that because with the data you provided it's not clear which things changed from one category to the next across years.您的数据中确实没有足够的信息来制作完全一样的图表,因为根据您提供的数据,不清楚哪些事情在几年之间从一个类别变为下一个类别。 Maybe you were trying to achieve that with the UniqID column, but the way the data is, it doesn't make sense...也许您试图使用UniqID列来实现这一点,但数据的方式是没有意义的......

df <- data.frame(UniqID=rep(c(1:10), times=4), 
           Year=rep(c("2005", "2010", "2015", "2020"), times=10), 
           Response_Variable=round(runif(n=40, min = 0, max = 2), digits=0))

library(dplyr)

df %>% arrange(UniqID, Year) %>% filter(UniqID == 1)
#>   UniqID Year Response_Variable
#> 1      1 2005                 2
#> 2      1 2005                 1
#> 3      1 2015                 1
#> 4      1 2015                 0

Ignoring that, the data format you're asking about is a list of "links" each one defining a movement from one "node", the "source" node, to another "node", the "target" "node".忽略这一点,您要询问的数据格式是“链接”列表,每个链接定义从一个“节点”,“源”节点到另一个“节点”,“目标”“节点”的移动。 So in your case, each year-category combination is a "node", and you need a list of each "link" between those nodes, and potentially a "value" for each of your links, which in your case the number of occurrences of the source node makes the most sense.因此,在您的情况下,每个年份类别组合都是一个“节点”,您需要这些节点之间的每个“链接”的列表,以及每个链接的潜在“值”,在您的情况下是出现次数源节点最有意义。 You could reshape your data to that format like this...您可以像这样将数据重塑为该格式...

df %>% 
  group_by(Year, Response_Variable) %>% 
  summarise(value = n(), .groups = "drop") %>% 
  mutate(source = paste(Year, Response_Variable, sep = "_")) %>% 
  group_by(Response_Variable) %>% 
  mutate(target = lead(source, order_by = Year)) %>% 
  filter(!is.na(target))
#> # A tibble: 9 × 5
#> # Groups:   Response_Variable [3]
#>   Year  Response_Variable value source target
#>   <chr>             <dbl> <int> <chr>  <chr> 
#> 1 2005                  0     4 2005_0 2010_0
#> 2 2005                  1     3 2005_1 2010_1
#> 3 2005                  2     3 2005_2 2010_2
#> 4 2010                  0     2 2010_0 2015_0
#> 5 2010                  1     6 2010_1 2015_1
#> 6 2010                  2     2 2010_2 2015_2
#> 7 2015                  0     3 2015_0 2020_0
#> 8 2015                  1     3 2015_1 2020_1
#> 9 2015                  2     4 2015_2 2020_2

To get to the more specific format that {networkD3} requires, you need one data.frame for links and one that lists each node.要获得 {networkD3} 所需的更具体的格式,您需要一个用于链接的 data.frame 和一个列出每个节点的 data.frame。 The links data.frame needs to refer to each node in the nodes data.frame by its 0-based index. links data.frame 需要通过其从 0 开始的索引来引用nodes data.frame 中的每个节点。 You can set that up like this...你可以这样设置...

library(dplyr)
library(networkD3)

df <- 
  data.frame(
    UniqID=rep(c(1:10), times=4), 
    Year=rep(c("2005", "2010", "2015", "2020"), times=10), 
    Response_Variable=round(runif(n=40, min = 0, max = 2), digits=0)
  )

links <-
  df %>% 
  group_by(Year, Response_Variable) %>% 
  summarise(value = n(), .groups = "drop") %>% 
  mutate(source = paste(Year, Response_Variable, sep = "_")) %>% 
  group_by(Response_Variable) %>% 
  mutate(target = lead(source, order_by = Year)) %>% 
  filter(!is.na(target)) %>% 
  ungroup() %>% 
  select(source, target, value)

nodes <- data.frame(node_id = unique(c(links$source, links$target)))  

links$source <- match(links$source, nodes$node_id) - 1
links$target <- match(links$target, nodes$node_id) - 1

sankeyNetwork(
  Links = links,
  Nodes = nodes,
  Source = "source", 
  Target = "target", 
  Value = "value", 
  NodeID = "node_id"
)
#> Links is a tbl_df. Converting to a plain data frame.

The answer is to use ggsankey and not plotly nor networkD3!答案是使用 ggsankey 而不是 plotly 也不是 networkD3!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM