[英]making a wider dataframe using factor columns
ok, so this one is kind of long, I have a couple huge dataframes that I'm trying to make wider and eventually merge.好的,所以这个有点长,我有几个巨大的数据框,我试图扩大它们并最终合并。 I want to merge and group by Year and county.我想按年份和县合并和分组。
I have a couple of columns with factors that I'm trying to spread.我有几个专栏,其中包含我试图传播的因素。 essentially I want to take factor x,y,z and make them columns, x,y, and z.本质上,我想采用因子 x、y、z 并将它们设为列、x、y 和 z。 I have an example below .我在下面有一个例子。 Additionally, I have a few columns that are numeric that I would like to sum by group.此外,我有几列数字,我想按组求和。
I've tried to provide an example and some reproducible code to work with hopefully that's enough, but please let me know if there's anything I can do to make things easier/clearer, and thanks so much for the help!我试图提供一个示例和一些可重现的代码,希望这些就足够了,但是如果我能做些什么来让事情变得更容易/更清晰,请告诉我,非常感谢你的帮助!
YR<-as.factor( c(2019,2018,2019,2019,2018,2018,2019,2019,2018))
STATE<-as.factor( c("CA","MA","KY","KY","CA","MA","KY","KY","CA"))
COUNTY<-as.factor( c("C1","M1","K1","K2","C1","M2","K1","K2","C1"))
CANCER<-as.factor(c("Cervical","Lung","Prostate","Breast","Cervical","Breast","Prostate","Prostate","Lung"))
rand_fact<-as.factor(c("rf1","rf2","rf3","fr4","fr5","rf2","rf3","fr4","fr5"))
rand_num<-as.numeric(c(4,3,5,7,3,5,3,24,9))
rand_chr<-as.character(c("a","d","r","e","g","y","r","e","k"))
TEST_DR<-data.frame(YR,STATE,COUNTY,CANCER,rand_fact,rand_num,rand_chr)
rm(YR,STATE,COUNTY,CANCER,rand_chr,rand_num,rand_fact)
> print(TEST_DR)
YR STATE COUNTY CANCER rand_fact rand_num rand_chr
1 2018 CA C1 Cervical fr5 3 g
2 2018 CA C1 Lung fr5 9 k
3 2018 MA M1 Lung rf2 3 d
4 2018 MA M2 Breast rf2 5 y
5 2019 CA C1 Cervical rf1 4 a
6 2019 KY K1 Prostate rf3 5 r
7 2019 KY K1 Prostate rf3 3 r
8 2019 KY K2 Breast fr4 7 e
9 2019 KY K2 Prostate fr4 24 e
#Idealy the output will look like below with rows grouped by YR then COUNTY
TEST_DR<-arrange(.data = TEST_DR,YR,COUNTY)
YR<-as.factor( c(2018,2018,2018,2019,2019,2019))
STATE<-as.factor( c("CA","MA","MA","CA","KY","KY"))
COUNTY<-as.factor( c("C1","M1","M2","C1","K1","K2"))
Cervical<-as.numeric(c(1,0,0,1,0,0))
Lung <-as.numeric(c(1,1,0,0,0,0))
Prostate<-as.numeric(c(0,0,0,0,2,1))
Breast<-as.numeric(c(0,0,1,0,0,1))
TEST_DR2 <-data.frame(YR,STATE,COUNTY,Cervical,Lung,Prostate,Breast)
rm(YR,STATE,COUNTY,Cervical,Lung,Prostate,Breast)
> print(TEST_DR2)
YR STATE COUNTY Cervical Lung Prostate Breast rand_num
1 2018 CA C1 1 1 0 0 12
2 2018 MA M1 0 1 0 0 3
3 2018 MA M2 0 0 0 1 5
4 2019 CA C1 1 0 0 0 4
5 2019 KY K1 0 0 2 0 8
6 2019 KY K2 0 0 1 1 31
Here is a way to do it with count()
and {tidyr} spread()
这是一种使用count()
和 {tidyr} spread()
的方法
YR <- as.factor( c(2019,2018,2019,2019,2018,2018,2019,2019,2018))
STATE <- as.factor( c("CA","MA","KY","KY","CA","MA","KY","KY","CA"))
COUNTY <- as.factor( c("C1","M1","K1","K2","C1","M2","K1","K2","C1"))
CANCER <- as.factor(c("Cervical","Lung","Prostate","Breast","Cervical","Breast","Prostate","Prostate","Lung"))
rand_fact <- as.factor(c("rf1","rf2","rf3","fr4","fr5","rf2","rf3","fr4","fr5"))
rand_num <- as.numeric(c(4,3,5,7,3,5,3,24,9))
rand_chr <- as.character(c("a","d","r","e","g","y","r","e","k"))
TEST_DR <- data.frame(YR, STATE, COUNTY, CANCER, rand_fact, rand_num, rand_chr)
rm(YR,STATE,COUNTY,CANCER,rand_chr,rand_num,rand_fact)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
TEST_DR %>%
group_by(YR, STATE, COUNTY) %>%
count(CANCER, rand_num = sum(rand_num)) %>%
spread(CANCER, n, fill = 0)
#> # A tibble: 6 x 8
#> # Groups: YR, STATE, COUNTY [6]
#> YR STATE COUNTY rand_num Breast Cervical Lung Prostate
#> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2018 CA C1 12 0 1 1 0
#> 2 2018 MA M1 3 0 0 1 0
#> 3 2018 MA M2 5 1 0 0 0
#> 4 2019 CA C1 4 0 1 0 0
#> 5 2019 KY K1 8 0 0 0 2
#> 6 2019 KY K2 31 1 0 0 1
Created on 2020-12-02 by the reprex package (v0.3.0)由reprex package (v0.3.0) 创建于 2020-12-02
And for the most up-to-date {tidyverse} syntactic sugar...对于最新的 {tidyverse} 语法糖......
TEST_DR %>%
group_by(YR, STATE, COUNTY) %>%
count(CANCER, rand_num = sum(rand_num)) %>%
pivot_wider(names_from = CANCER, values_from = n, values_fill = 0)
With the exception of having to aggregate the rand_num
column, you can almost directly just use dcast
on this.除了必须聚合rand_num
列之外,您几乎可以直接在其上使用dcast
。 Here's how I would approach it:以下是我将如何处理它:
library(data.table)
# Create a vector of keys that we can use for grouping and for
# identifying the columns for the left-hand-side of the dcast formula
keys <- c("YR", "STATE", "COUNTY")
# * melt from data.table expects a data.table, so use either setDT or
# as.data.table to convert your data.frame to a data.table
# * .N creates a count by the grouping variables. CANCER has been
# added since we want to count the number of instances. It will
# create a new column named "N" in the data
as.data.table(TEST_DR)[, list(rand_num, .N), c(keys, "CANCER")][
# Sum the rand_num variable by the grouping variable
, rand_num := sum(rand_num), keys][
# Go from long to wide using dcast.
# * ... on the left-hand-side of the formula says to use all
# of the unspecified variables
# * ~ CANCER says that the values from the CANCER column should
# become the new column names
# * value.var = "N" says to fill in the combination of LHS and
# RHS with values from the N column
, dcast(.SD, ... ~ CANCER, value.var = "N")]
# YR STATE COUNTY rand_num Breast Cervical Lung Prostate
# 1: 2018 CA C1 12 0 1 1 0
# 2: 2018 MA M1 3 0 0 1 0
# 3: 2018 MA M2 5 1 0 0 0
# 4: 2019 CA C1 4 0 1 0 0
# 5: 2019 KY K1 8 0 0 0 2
# 6: 2019 KY K2 31 1 0 0 1
"data.table" commands can be chained (similar to using pipes) by just passing the result of one operation to the next operation. “data.table”命令可以通过将一个操作的结果传递给下一个操作来链接(类似于使用管道)。 For example, as.data.table(df)[, do_something][, do_something_else][, do_even_more]
.例如, as.data.table(df)[, do_something][, do_something_else][, do_even_more]
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.