在 R 中制作桑基图

Question

I'm trying to create a Sankey Diagram.我正在尝试创建一个桑基图。 I am using R with either {plotly} or {networkD3} packages.我将 R 与 {plotly} 或 {networkD3} 包一起使用。 Both ask for the same type of data: source, target, value.两者都要求相同类型的数据：源、目标、值。 I'm not really sure what source, target, and value is supposed to be and how to aggregate my data to this format.我不确定应该是什么来源、目标和价值，以及如何将我的数据聚合成这种格式。 I have the following:我有以下内容：

data.frame(
  UniqID = rep(c(1:10), times=4), 
  Year = c(rep("2005", times=10), rep("2010", times=10), rep("2015", times=10), rep("2020", times=10)),
  Response_Variable = round(runif(n = 40, min = 0, max = 2), digits = 0)
)

The response variable is a categorical variable of 0, 1, or 2. I would like to show the flow of the classes of this variable from one year to the next.响应变量是 0、1 或 2 的分类变量。我想展示这个变量的类从一年到下一年的流动。 The final product should look something like this:最终产品应如下所示：

In my case, "Wave" would be Year and "Outcome" would be the classes (0, 1, 2) of the response variable.在我的例子中，“Wave”将是Year ，“Outcome”将是响应变量的类别 (0, 1, 2)。

Answer 1

You don't really have enough information in your data to make a chart exactly like that because with the data you provided it's not clear which things changed from one category to the next across years.您的数据中确实没有足够的信息来制作完全一样的图表，因为根据您提供的数据，不清楚哪些事情在几年之间从一个类别变为下一个类别。 Maybe you were trying to achieve that with the UniqID column, but the way the data is, it doesn't make sense...也许您试图使用UniqID列来实现这一点，但数据的方式是没有意义的......

df <- data.frame(UniqID=rep(c(1:10), times=4), 
           Year=rep(c("2005", "2010", "2015", "2020"), times=10), 
           Response_Variable=round(runif(n=40, min = 0, max = 2), digits=0))

library(dplyr)

df %>% arrange(UniqID, Year) %>% filter(UniqID == 1)
#>   UniqID Year Response_Variable
#> 1      1 2005                 2
#> 2      1 2005                 1
#> 3      1 2015                 1
#> 4      1 2015                 0

Ignoring that, the data format you're asking about is a list of "links" each one defining a movement from one "node", the "source" node, to another "node", the "target" "node".忽略这一点，您要询问的数据格式是“链接”列表，每个链接定义从一个“节点”，“源”节点到另一个“节点”，“目标”“节点”的移动。 So in your case, each year-category combination is a "node", and you need a list of each "link" between those nodes, and potentially a "value" for each of your links, which in your case the number of occurrences of the source node makes the most sense.因此，在您的情况下，每个年份类别组合都是一个“节点”，您需要这些节点之间的每个“链接”的列表，以及每个链接的潜在“值”，在您的情况下是出现次数源节点最有意义。 You could reshape your data to that format like this...您可以像这样将数据重塑为该格式...

df %>% 
  group_by(Year, Response_Variable) %>% 
  summarise(value = n(), .groups = "drop") %>% 
  mutate(source = paste(Year, Response_Variable, sep = "_")) %>% 
  group_by(Response_Variable) %>% 
  mutate(target = lead(source, order_by = Year)) %>% 
  filter(!is.na(target))
#> # A tibble: 9 × 5
#> # Groups:   Response_Variable [3]
#>   Year  Response_Variable value source target
#>   <chr>             <dbl> <int> <chr>  <chr> 
#> 1 2005                  0     4 2005_0 2010_0
#> 2 2005                  1     3 2005_1 2010_1
#> 3 2005                  2     3 2005_2 2010_2
#> 4 2010                  0     2 2010_0 2015_0
#> 5 2010                  1     6 2010_1 2015_1
#> 6 2010                  2     2 2010_2 2015_2
#> 7 2015                  0     3 2015_0 2020_0
#> 8 2015                  1     3 2015_1 2020_1
#> 9 2015                  2     4 2015_2 2020_2

To get to the more specific format that {networkD3} requires, you need one data.frame for links and one that lists each node.要获得 {networkD3} 所需的更具体的格式，您需要一个用于链接的 data.frame 和一个列出每个节点的 data.frame。 The links data.frame needs to refer to each node in the nodes data.frame by its 0-based index. links data.frame 需要通过其从 0 开始的索引来引用nodes data.frame 中的每个节点。 You can set that up like this...你可以这样设置...

library(dplyr)
library(networkD3)

df <- 
  data.frame(
    UniqID=rep(c(1:10), times=4), 
    Year=rep(c("2005", "2010", "2015", "2020"), times=10), 
    Response_Variable=round(runif(n=40, min = 0, max = 2), digits=0)
  )

links <-
  df %>% 
  group_by(Year, Response_Variable) %>% 
  summarise(value = n(), .groups = "drop") %>% 
  mutate(source = paste(Year, Response_Variable, sep = "_")) %>% 
  group_by(Response_Variable) %>% 
  mutate(target = lead(source, order_by = Year)) %>% 
  filter(!is.na(target)) %>% 
  ungroup() %>% 
  select(source, target, value)

nodes <- data.frame(node_id = unique(c(links$source, links$target)))  

links$source <- match(links$source, nodes$node_id) - 1
links$target <- match(links$target, nodes$node_id) - 1

sankeyNetwork(
  Links = links,
  Nodes = nodes,
  Source = "source", 
  Target = "target", 
  Value = "value", 
  NodeID = "node_id"
)
#> Links is a tbl_df. Converting to a plain data frame.

Answer 2

The answer is to use ggsankey and not plotly nor networkD3!答案是使用 ggsankey 而不是 plotly 也不是 networkD3！

在 R 中制作桑基图

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-09-01 13:53:31

解决方案2
0 已采纳 2022-09-03 20:45:20

在 R 中制作桑基图

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-09-01 13:53:31

解决方案2 0 已采纳 2022-09-03 20:45:20

解决方案1
1 已采纳 2022-09-01 13:53:31

解决方案2
0 已采纳 2022-09-03 20:45:20