[英]How can I create a new wide data frame with rows based on all combos of values in two columns?
I have the following dataframe (dput is provided at the bottom of the question):我有以下数据框(问题底部提供了dput):
>df_input
# A tibble: 5 x 4
category range samples events
<chr> <chr> <dbl> <dbl>
1 GroupA Apr2002 4951 97796
2 GroupA May2002 9332 195726
3 GroupB Apr2001 4781 80767
4 GroupB Oct2001 5677 92890
5 GroupB OctToNov2001 10296 166037
I would like to create a new dataframe with rows that are a combination of each unique combination of both the category
and range
columns.我想创建一个新的数据框,其中的行是category
和range
列的每个唯一组合的组合。 For example, category = GroupA
and range = Apr2002
would have 3 rows in the output dataframe for each of the three category = Group B
rows.例如, category = GroupA
GroupA 和range = Apr2002
将在输出数据框中为三个category = Group B
行中的每一行提供 3 行。
The range
column in the input dataframe will always have unique values only.输入数据框中的range
列将始终只有唯一值。
I would also like to rename the combined output columns for events
, samples
and range
to include the Group
names (ie range_GroupA
, range_GroupB
, samples_GroupA
, events_GroupA
, samples_GroupB
, events_GroupB
)我还想重命名events
、 samples
和range
的组合输出列以包含Group
名称(即range_GroupA
、 range_GroupB
、 samples_GroupA
、 events_GroupA
、 samples_GroupB
、 events_GroupB
)
I'm struggling with how to create my combined rows from the category
column.我正在努力解决如何从category
列创建组合行。 I'm also struggling to find the right search terms here to find similar questions/answers.我也在努力在这里找到正确的搜索词来找到类似的问题/答案。 The closest I've managed to find so far is Create new rows in data frame based on multiple values of column , but the combo in that question is a bit different that what I'm attempting.到目前为止,我设法找到的最接近的是Create new rows in data frame based on multiple values of column ,但是该问题中的组合与我尝试的有点不同。
The desired output dataframe is:所需的输出数据帧是:
> df_output
# A tibble: 6 x 6
range_GroupA range_GroupB samples_GroupA events_GroupA samples_GroupB events_GroupB
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Apr2002 Apr2001 4951 97796 4781 80767
2 Apr2002 Oct2001 4951 97796 5677 92890
3 Apr2002 OctToNov2001 4951 97796 10296 166037
4 May2002 Apr2001 9332 195726 4781 80767
5 May2022 Oct2001 9332 195726 5677 92890
6 May2022 OctToNov2001 9332 195726 10296 166037
df_input dataframe: df_input 数据框:
df_input <- structure(list(category = c("GroupA", "GroupA", "GroupB", "GroupB",
"GroupB"), range = c("Apr2002", "May2002", "Apr2001", "Oct2001",
"OctToNov2001"), samples = c(4951, 9332, 4781, 5677, 10296),
events = c(97796, 195726, 80767, 92890, 166037)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
df_output dataframe df_output 数据帧
df_output <- structure(list(range_GroupA = c("Apr2002", "Apr2002", "Apr2002",
"May2002", "May2022", "May2022"), range_GroupB = c("Apr2001",
"Oct2001", "OctToNov2001", "Apr2001", "Oct2001", "OctToNov2001"
), samples_GroupA = c(4951, 4951, 4951, 9332, 9332, 9332), events_GroupA = c(97796,
97796, 97796, 195726, 195726, 195726), samples_GroupB = c(4781,
5677, 10296, 4781, 5677, 10296), events_GroupB = c(80767, 92890,
166037, 80767, 92890, 166037)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I think we can get your result with a filtered cartesian join:我认为我们可以通过过滤笛卡尔连接获得您的结果:
library(dplyr)
left_join(
df_input %>% mutate(dummy = 1),
df_input %>% mutate(dummy = 1), by = "dummy") %>%
filter(category.x < category.y)
You'll recognize all the numbers you're looking for, but with different header names.您将识别所有要查找的数字,但标题名称不同。 We can rename them manually, but that's no fun.我们可以手动重命名它们,但这并不好玩。 See below for renamed version.请参阅下面的重命名版本。
# A tibble: 6 × 9
category.x range.x samples.x events.x dummy category.y range.y samples.y events.y
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 GroupA Apr2002 4951 97796 1 GroupB Apr2001 4781 80767
2 GroupA Apr2002 4951 97796 1 GroupB Oct2001 5677 92890
3 GroupA Apr2002 4951 97796 1 GroupB OctToNov2001 10296 166037
4 GroupA May2002 9332 195726 1 GroupB Apr2001 4781 80767
5 GroupA May2002 9332 195726 1 GroupB Oct2001 5677 92890
6 GroupA May2002 9332 195726 1 GroupB OctToNov2001 10296 166037
EDIT: This seems to do it with the renaming:编辑:这似乎与重命名有关:
left_join(
df_input %>% rename_with(~paste0(.,"_GroupA")) %>% mutate(dummy = 1),
df_input %>% rename_with(~paste0(.,"_GroupB")) %>% mutate(dummy = 1),
by = "dummy") %>%
filter(category_GroupA < category_GroupB) %>%
select(-category_GroupA, -dummy, -category_GroupB) %>%
relocate(range_GroupB, .after = 1)
# A tibble: 6 × 6
range_GroupA range_GroupB samples_GroupA events_GroupA samples_GroupB events_GroupB
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Apr2002 Apr2001 4951 97796 4781 80767
2 Apr2002 Oct2001 4951 97796 5677 92890
3 Apr2002 OctToNov2001 4951 97796 10296 166037
4 May2002 Apr2001 9332 195726 4781 80767
5 May2002 Oct2001 9332 195726 5677 92890
6 May2002 OctToNov2001 9332 195726 10296 166037
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.