[英]Convert multiple rows into single column in R
使用R,我有一個大的數據框,其中以下是一個例子:
df = data.frame(X1 = c("02JAN2008","09:30 - 10:00", "10:00 - 10:30", "10:30 - 11:00","11:00 - 11:30", "15:30 - 16:00", "16:00 - 16:30", "03JAN2008", "09:30 - 10:00", "10:00 - 10:30", "10:30 - 11:00", "11:00 - 11:30"),X2 = c(NA, 1469.37, 1459.91, 1456.92, 1453.48, 1447.22, 1447.16,NA, 1449.78, 1451.21, 1450.08, 1452.16),X3 = c(NA, 1467.97, 1467.11, 1459.76, 1457.00, 1444.00, 1447.67,NA, 1447.55, 1450.66, 1452.06, 1450.01))
看起來像:
X1 X2 X3
1 02JAN2008 NA NA
2 09:30 - 10:00 1469.37 1467.97
3 10:00 - 10:30 1459.91 1467.11
4 10:30 - 11:00 1456.92 1459.76
5 11:00 - 11:30 1453.48 1457.00
6 15:30 - 16:00 1447.22 1444.00
7 16:00 - 16:30 1447.16 1447.67
8 03JAN2008 NA NA
9 09:30 - 10:00 1449.78 1447.55
10 10:00 - 10:30 1451.21 1450.66
11 10:30 - 11:00 1450.08 1452.06
12 11:00 - 11:30 1452.16 1450.01
由於缺少數據,有些日子可能會有6個觀測值,但對於其他天數,可能只有4個(或更少,這是一個例子)。
我想將其轉換為數據框,每30分鍾間隔將日期作為單獨的列,例如:
X1 X2 X3 X4
1 02JAN2008 09:30 - 10:00 1469.37 1467.97
2 02JAN2008 10:00 - 10:30 1459.91 1467.11
3 02JAN2008 10:30 - 11:00 1456.92 1459.76
4 02JAN2008 11:00 - 11:30 1453.48 1457.00
5 02JAN2008 15:30 - 16:00 1447.22 1444.00
6 02JAN2008 16:00 - 16:30 1447.16 1447.67
7 03JAN2008 09:30 - 10:00 1449.78 1447.55
8 03JAN2008 10:00 - 10:30 1451.21 1450.66
9 03JAN2008 10:30 - 11:00 1450.08 1452.06
10 03JAN2008 11:00 - 11:30 1452.16 1450.01
我可以很容易地抓取df
的索引,其中X2是NA
然后寫一個for
循環來結轉日期,但是我想避免在R中使用for
循環。
我怎么能在R中這樣做? 當然可以使用dplyr
或tidyr
解決方案,但我無法從文檔中的示例中生成一個解決方案。 或許某些版本的melt
?
這是一個選項:
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
dt[, grp := cumsum(is.na(X2))][, c(date = list(X1[1]), tail(.SD, -1)), by = grp]
# grp date X1 X2 X3
# 1: 1 02JAN2008 09:30 - 10:00 1469.37 1467.97
# 2: 1 02JAN2008 10:00 - 10:30 1459.91 1467.11
# 3: 1 02JAN2008 10:30 - 11:00 1456.92 1459.76
# 4: 1 02JAN2008 11:00 - 11:30 1453.48 1457.00
# 5: 1 02JAN2008 15:30 - 16:00 1447.22 1444.00
# 6: 1 02JAN2008 16:00 - 16:30 1447.16 1447.67
# 7: 2 03JAN2008 09:30 - 10:00 1449.78 1447.55
# 8: 2 03JAN2008 10:00 - 10:30 1451.21 1450.66
# 9: 2 03JAN2008 10:30 - 11:00 1450.08 1452.06
#10: 2 03JAN2008 11:00 - 11:30 1452.16 1450.01
這是一個dplyr方式:
breaks <- is.na(df$X2)
df %>%
mutate(date=X1[breaks][cumsum(breaks)]) %>%
filter(!breaks)
# X1 X2 X3 date
# 1 09:30 - 10:00 1469 1468 02JAN2008
# 2 10:00 - 10:30 1460 1467 02JAN2008
# 3 10:30 - 11:00 1457 1460 02JAN2008
# 4 11:00 - 11:30 1453 1457 02JAN2008
# 5 15:30 - 16:00 1447 1444 02JAN2008
# 6 16:00 - 16:30 1447 1448 02JAN2008
# 7 09:30 - 10:00 1450 1448 03JAN2008
# 8 10:00 - 10:30 1451 1451 03JAN2008
# 9 10:30 - 11:00 1450 1452 03JAN2008
# 10 11:00 - 11:30 1452 1450 03JAN2008
或者就像基地R一樣簡單:
df <- within(df, date <- X1[breaks][cumsum(breaks)])
df[! breaks, ]
一種方法是來自zoo
na.locf
:
require(zoo)
df0<-cbind(df$X1,df)
df0[!is.na(df0[,3]),1]<-NA
df0[,1]<-na.locf(df0[,1])
df0<-df0[!is.na(df0[,3]),]
這使:
> df0
df$X1 X1 X2 X3
2 02JAN2008 09:30 - 10:00 1469.37 1467.97
3 02JAN2008 10:00 - 10:30 1459.91 1467.11
4 02JAN2008 10:30 - 11:00 1456.92 1459.76
5 02JAN2008 11:00 - 11:30 1453.48 1457.00
6 02JAN2008 15:30 - 16:00 1447.22 1444.00
7 02JAN2008 16:00 - 16:30 1447.16 1447.67
9 03JAN2008 09:30 - 10:00 1449.78 1447.55
10 03JAN2008 10:00 - 10:30 1451.21 1450.66
11 03JAN2008 10:30 - 11:00 1450.08 1452.06
12 03JAN2008 11:00 - 11:30 1452.16 1450.01
base R
選項將是
df$X1 <- as.character(df$X1)
indx <- !grepl(':', df$X1)
res <- setNames(data.frame(unlist(tapply(df$X1[indx][cumsum(indx)],
cumsum(indx), FUN=head, -1)), df[!indx,]), paste0("X",1:4))
row.names(res) <- NULL
res
# X1 X2 X3 X4
#1 02JAN2008 09:30 - 10:00 1469.37 1467.97
#2 02JAN2008 10:00 - 10:30 1459.91 1467.11
#3 02JAN2008 10:30 - 11:00 1456.92 1459.76
#4 02JAN2008 11:00 - 11:30 1453.48 1457.00
#5 02JAN2008 15:30 - 16:00 1447.22 1444.00
#6 02JAN2008 16:00 - 16:30 1447.16 1447.67
#7 03JAN2008 09:30 - 10:00 1449.78 1447.55
#8 03JAN2008 10:00 - 10:30 1451.21 1450.66
#9 03JAN2008 10:30 - 11:00 1450.08 1452.06
#10 03JAN2008 11:00 - 11:30 1452.16 1450.01
要么
res2 <- do.call(rbind,lapply(Map(cbind, df$X1[indx],split(df[!indx,],
cumsum(indx)[!indx])), setNames, paste0('X', 1:4)))
row.names(res2) <- NULL
我試過這個:
> na_ind <- which(is.na(df$X2))
> day_break <- c(na_ind, nrow(df) + 1)
> day_count <- day_break[-1] - day_break[-length(day_break)] -1
> day_count
## [1] 6 4
> new_df <- cbind(date = rep(df$X1[na_ind], times = day_count),
+ df[-na_ind,])
> new_df
## date X1 X2 X3
## 2 02JAN2008 09:30 - 10:00 1469.37 1467.97
## 3 02JAN2008 10:00 - 10:30 1459.91 1467.11
## 4 02JAN2008 10:30 - 11:00 1456.92 1459.76
## 5 02JAN2008 11:00 - 11:30 1453.48 1457.00
## 6 02JAN2008 15:30 - 16:00 1447.22 1444.00
## 7 02JAN2008 16:00 - 16:30 1447.16 1447.67
## 9 03JAN2008 09:30 - 10:00 1449.78 1447.55
## 10 03JAN2008 10:00 - 10:30 1451.21 1450.66
## 11 03JAN2008 10:30 - 11:00 1450.08 1452.06
## 12 03JAN2008 11:00 - 11:30 1452.16 1450.01
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.