[英]Transpose / reshape dataframe without "timevar" from long to wide format
I have a data frame that follows the below long Pattern:我有一个遵循以下长模式的数据框:
Name MedName
Name1 atenolol 25mg
Name1 aspirin 81mg
Name1 sildenafil 100mg
Name2 atenolol 50mg
Name2 enalapril 20mg
And would like to get below (I do not care if I can get the columns to be named this way, just want the data in this format):并且想得到下面(我不在乎我是否可以这样命名列,只想要这种格式的数据):
Name medication1 medication2 medication3
Name1 atenolol 25mg aspirin 81mg sildenafil 100mg
Name2 atenolol 50mg enalapril 20mg NA
Through this very site I have become familiarish with the reshape/reshape2 package, and have went through several attempts to try to get this to work but have thus far failed.通过这个站点,我已经熟悉了 reshape/reshape2 package,并且已经尝试过几次尝试让它工作,但到目前为止都失败了。
When I try dcast(dataframe, Name ~ MedName, value.var='MedName')
I just get a bunch of columns that are flags of the medication names (values that get transposed are 1 or 0) example:当我尝试dcast(dataframe, Name ~ MedName, value.var='MedName')
,我只得到一堆作为药物名称标志的列(转置的值是 1 或 0)例如:
Name atenolol 25mg aspirin 81mg
Name1 1 1
Name2 0 0
I also tried a dcast(dataset, Name ~ variable)
after I melted the dataset, however this just spits out the following (just counts how many meds each person has):在我融化数据集后,我还尝试了一个dcast(dataset, Name ~ variable)
,但这只是吐出以下内容(只是计算每个人有多少药物):
Name MedName
Name1 3
name2 2
Finally, I tried to melt the data and then reshape using idvar="Name"
timevar="variable"
(of which all just are Mednames), however this does not seem built for my issue since if there are multiple matches to the idvar, the reshape just takes the first MedName and ignores the rest.最后,我尝试融合数据,然后使用idvar="Name"
timevar="variable"
(其中所有只是 Mednames)重塑数据,但这似乎不是为我的问题而建的,因为如果 idvar 有多个匹配项,重塑只采用第一个 MedName 并忽略 rest。
Does anyone know how to do this using reshape or another R function?有谁知道如何使用 reshape 或其他 R function 来做到这一点? I realize that there probably is a way to do this in a more messy manner with some for loops and conditionals to basically split and re-paste the data, but I was hoping there was a more simple solution.我意识到可能有一种方法可以通过一些 for 循环和条件来以更混乱的方式执行此操作,以基本上拆分和重新粘贴数据,但我希望有一个更简单的解决方案。 Thank you so much!太感谢了!
With the data.table package, this could easily be solved with the new rowid
function:使用data.table包,这可以通过新的rowid
函数轻松解决:
library(data.table)
dcast(setDT(d1),
Name ~ rowid(Name, prefix = "medication"),
value.var = "MedName")
which gives:这使:
Name medication1 medication2 medication3 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg 2 Name2 atenolol 50mg enalapril 20mg <NA>
Another method (commonly used before version 1.9.7):另一种方法(1.9.7版本之前常用):
dcast(setDT(d1)[, rn := 1:.N, by = Name],
Name ~ paste0("medication",rn),
value.var = "MedName")
giving the same result.给出相同的结果。
A similar approach, but now using the dplyr and tidyr packages:类似的方法,但现在使用dplyr和tidyr包:
library(dplyr)
library(tidyr)
d1 %>%
group_by(Name) %>%
mutate(rn = paste0("medication",row_number())) %>%
spread(rn, MedName)
which gives:这使:
Source: local data frame [2 x 4] Groups: Name [2] Name medication1 medication2 medication3 (fctr) (chr) (chr) (chr) 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg 2 Name2 atenolol 50mg enalapril 20mg NA
Assuming your data is in the object dataset
:假设您的数据在对象dataset
:
library(plyr)
## Add a medication index
data_with_index <- ddply(dataset, .(Name), mutate,
index = paste0('medication', 1:length(Name)))
dcast(data_with_index, Name ~ index, value.var = 'MedName')
## Name medication1 medication2 medication3
## 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg
## 2 Name2 atenolol 50mg enalapril 20mg <NA>
You could always generate a unique timevar
before using reshape
.在使用reshape
之前,您始终可以生成唯一的timevar
。 Here I use ave
to apply the function seq_along
'along' each "Name".在这里,我使用ave
来应用函数seq_along
'along' 每个“名称”。
test <- data.frame(
Name=c(rep("name1",3),rep("name2",2)),
MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
"atenolol 50mg","enalapril 20mg")
)
# generate the 'timevar'
test$uniqid <- with(test, ave(as.character(Name), Name, FUN = seq_along))
# reshape!
reshape(test, idvar = "Name", timevar = "uniqid", direction = "wide")
Result:结果:
Name MedName.1 MedName.2 MedName.3
1 name1 atenolol 25mg aspirin 81mg sildenafil 100mg
4 name2 atenolol 50mg enalapril 20mg <NA>
This seems to actually be a fairly common problem, so I have included a function called getanID
in my "splitstackshape" package.这实际上似乎是一个相当普遍的问题,因此我在“splitstackshape”包中包含了一个名为getanID
的函数。
Here's what it does:这是它的作用:
library(splitstackshape)
getanID(test, "Name")
# Name MedName .id
# 1: name1 atenolol 25mg 1
# 2: name1 aspirin 81mg 2
# 3: name1 sildenafil 100mg 3
# 4: name2 atenolol 50mg 1
# 5: name2 enalapril 20mg 2
Since "data.table" is loaded along with "splitstackshape", you have access to dcast.data.table
, so you can proceed as with @mnel's example.由于“data.table”与“splitstackshape”一起加载,因此您可以访问dcast.data.table
,因此您可以按照dcast.data.table
的示例进行操作。
dcast.data.table(getanID(test, "Name"), Name ~ .id, value.var = "MedName")
# Name 1 2 3
# 1: name1 atenolol 25mg aspirin 81mg sildenafil 100mg
# 2: name2 atenolol 50mg enalapril 20mg NA
The function essentially implements a sequence(.N)
by the groups identified to create the "time" column.该函数本质上通过标识创建“时间”列的组来实现sequence(.N)
。
@thelatemail's solution is similar to this one. @thelatemail 的解决方案与此类似。 When I generate the time variable, I use rle
in case I'm not working interactively and the Name
variable needs to be dynamic.当我生成时间变量时,我使用rle
以防我不以交互方式工作并且Name
变量需要是动态的。
# start with your example data
x <-
data.frame(
Name=c(rep("name1",3),rep("name2",2)),
MedName=c("atenolol 25mg","aspirin 81mg","sildenafil 100mg",
"atenolol 50mg","enalapril 20mg")
)
# pick the id variable
id <- 'Name'
# sort the data.frame by that variable
x <- x[ order( x[ , id ] ) , ]
# construct a `time` variable on the fly
x$time <- unlist( lapply( rle( as.character( x[ , id ] ) )$lengths , seq_len ) )
# `reshape` uses that new `time` column by default
y <- reshape( x , idvar = id , direction = 'wide' )
# done
y
One clean solution involves the very useful pivot_wider
function from the tidyr
package version 1.1.0
.一个干净的解决方案涉及tidyr
包版本1.1.0
非常有用的pivot_wider
函数。 With this you can also directly specify the column names by using the argument names_glue
.有了这个,您还可以使用参数names_glue
直接指定列名。
library(tidyr)
library(dplyr)
dataframe %>%
group_by(Name) %>%
mutate(row_n = row_number()) %>%
pivot_wider(id_cols = Name, names_from = row_n, values_from = MedName, names_glue = "medication{row_n}")
Output输出
# A tibble: 2 x 4
# Groups: Name [2]
# Name medication1 medication2 medication3
# <chr> <chr> <chr> <chr>
# 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg
# 2 Name2 atenolol 50mg enalapril 20mg NA
A tidyr
solution with chop()
and unnest_wider()
.带有chop()
和unnest_wider()
tidyr
解决方案。
library(tidyr)
df %>%
chop(-Name) %>%
unnest_wider(MedName, names_sep = "")
# # A tibble: 2 x 4
# Name MedName1 MedName2 MedName3
# <chr> <chr> <chr> <chr>
# 1 Name1 atenolol 25mg aspirin 81mg sildenafil 100mg
# 2 Name2 atenolol 50mg enalapril 20mg NA
The argument names_sep = ""
is necessary;参数names_sep = ""
是必要的; otherwise, the new column names will be ..1
, ..2
, and ..3
.否则,新的列名会..1
, ..2
,和..3
。
Data数据
df <- structure(list(Name = c("Name1", "Name1", "Name1", "Name2", "Name2"
), MedName = c("atenolol 25mg", "aspirin 81mg", "sildenafil 100mg",
"atenolol 50mg", "enalapril 20mg")), class = "data.frame", row.names = c(NA, -5L))
Here's a shorter way, taking advantage of the way unlist
deals with names:这是一种更短的方法,利用unlist
处理名称的方式:
library(dplyr)
df1 %>% group_by(Name) %>% do(as_tibble(t(unlist(.[2]))))
# # A tibble: 2 x 4
# # Groups: Name [2]
# Name MedName1 MedName2 MedName3
# <chr> <chr> <chr> <chr>
# 1 name1 atenolol 25mg aspirin 81mg sildenafil 100mg
# 2 name2 atenolol 50mg enalapril 20mg <NA>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.