[英]Create a panel data frame
I would like to create a panel from a dataset that has one observation for every given time period such that every unit has a new observation for every time period. 我想从数据集创建一个面板,该面板在每个给定的时间段内都有一个观察点,这样每个单元对每个时间段都有一个新的观测值。 Using the following example:
使用以下示例:
id <- seq(1:4)
year <- c(2005, 2008, 2008, 2007)
y <- c(1,0,0,1)
frame <- data.frame(id, year, y)
frame
id year y
1 1 2005 1
2 2 2008 0
3 3 2008 0
4 4 2007 1
For each unique ID, I would like there to be a unique observation for the year 2005, 2006, 2007, and 2008 (the lower and upper time periods on this frame), and set the outcome y to 0 for all the times in which there isn't an existing observation, such that the new frame looks like: 对于每个唯一ID,我希望对2005,2006,2007和2008年(此帧的上下时间段)有一个独特的观察,并将结果y设置为0,对于所有时间没有现有的观察,这样新的框架看起来像:
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
....
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
I haven't had much success with loops; 我对循环没有太大的成功; Any and all thoughts would be greatly appreciated.
任何和所有的想法将不胜感激。
1) reshape2 Create a grid g
of all years and id
values crossed and rbind
it with frame
. 1)reshape2创建所有年份的网格
g
并交叉id
值并用frame
rbind
。
Then using the reshape2 package cast
frame
from long to wide form and then melt
it back to long form. 然后使用reshape2包装
cast
frame
从长到宽的形状,然后将其melt
回长形。 Finally rearrange the rows and columns as desired. 最后根据需要重新排列行和列。
The lines ending in one # are only to ensure that every year is present so if we knew that were the case those lines could be omitted. 以#结尾的行只是为了确保每年都存在,所以如果我们知道这种情况可以省略。 The line ending in ## is only to rearrange the rows and columns so if that did not matter that line could be omitted too.
以##结尾的行只是重新排列行和列,所以如果无关紧要,也可以省略该行。
library(reshape2)
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
wide <- dcast(frame, year ~ id, fill = 0, fun = sum, value.var = "y")
long <- melt(wide, id = "year", variable.name = "id", value.name = "y")
long <- long[order(long$id, long$year), c("id", "year", "y")] ##
giving: 赠送:
> long
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
5 2 2005 0
6 2 2006 0
7 2 2007 0
8 2 2008 0
9 3 2005 0
10 3 2006 0
11 3 2007 0
12 3 2008 0
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
2) aggregate A shorter solution would be to run just the two lines that end with # above and then follow those with an aggregate
as shown. 2)聚合更短的解决方案是仅运行以#结尾的两条线,然后跟随具有
aggregate
,如图所示。 This solution uses no addon packages. 此解决方案不使用插件包。
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
aggregate(y ~ year + id, frame, sum)[c("id", "year", "y")]
This gives the same answer as solution (1) except as noted by a commenter solution (1) above makes id
a factor whereas it is not in this solution. 这给出了与解决方案(1)相同的答案,除非上述评论者解决方案(1)指出使得
id
成为一个因素,而它不在该解决方案中。
Using data.table
: 使用
data.table
:
require(data.table)
DT <- data.table(frame, key=c("id", "year"))
comb <- CJ(1:4, 2005:2008) # like 'expand.grid', but faster + sets key
ans <- DT[comb][is.na(y), y:=0L] # perform a join (DT[comb]), then set NAs to 0
# id year y
# 1: 1 2005 1
# 2: 1 2006 0
# 3: 1 2007 0
# 4: 1 2008 0
# 5: 2 2005 0
# 6: 2 2006 0
# 7: 2 2007 0
# 8: 2 2008 0
# 9: 3 2005 0
# 10: 3 2006 0
# 11: 3 2007 0
# 12: 3 2008 0
# 13: 4 2005 0
# 14: 4 2006 0
# 15: 4 2007 1
# 16: 4 2008 0
maybe not an elegant solution, but anyway: 也许不是一个优雅的解决方案,但无论如何:
df <- expand.grid(id=id, year=unique(year))
frame <- frame[frame$y != 0,]
df$y <- 0
df2 <- rbind(frame, df)
df2 <- df2[!duplicated(df2[,c("id", "year")]),]
df2 <- df2[order(df2$id, df2$year),]
rownames(df2) <- NULL
df2
# id year y
# 1 1 2005 1
# 2 1 2006 0
# 3 1 2007 0
# 4 1 2008 0
# 5 2 2005 0
# 6 2 2006 0
# 7 2 2007 0
# 8 2 2008 0
# 9 3 2005 0
# 10 3 2006 0
# 11 3 2007 0
# 12 3 2008 0
# 13 4 2005 0
# 14 4 2006 0
# 15 4 2007 1
# 16 4 2008 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.