[英]wide to long data table transformation with variables in columns and rows
I have a csv with multiple tables with variables stored in both rows and columns. 我有一个带有多个表的csv,其中变量存储在行和列中。
About this csv: 关于此csv:
> df3
V1 V2 V3 V4 V5 V6 V7 V8
1 nyc 123 main st month 1 2 3 4 5
2 nyc 123 main st x 58568 567567 567909 35876 56943
3 nyc 123 main st y 5345 3673 3453 3467 788
4 nyc 123 main st z 53223 563894 564456 32409 56155
5
6 la 63 main st month 1 2 3 4 5
7 la 63 main st a 87035 7467456 3363 863 43673
8 la 63 main st b 345 456 345 678 345
9 la 63 main st c 86690 7467000 3018 185 43328
10
11 sf 953 main st month 1 2 3 4 5
12 sf 953 main st x 457456 3455 345345 56457 3634
13 sf 953 main st b 5345 3673 3453 3467 788
14 sf 953 main st z 452111 -218 341892 52990 2846
> df4
18 city address month x y z a b c
19 nyc 123 main st 1 58568 5345 53223 null null null
20 nyc 123 main st 2 567567 3673 563894 null null null
21 nyc 123 main st 3 567909 3453 564456 null null null
22 nyc 123 main st 4 35876 3467 32409 null null null
23 nyc 123 main st 5 56943 788 56155 null null null
24 la 63 main st 1 null null null 87035 345 86690
25 la 63 main st 2 null null null 7467456 456 7467000
26 la 63 main st 3 null null null 3363 345 3018
27 la 63 main st 4 null null null 863 678 185
28 la 63 main st 5 null null null 43673 345 43328
29 sf 953 main st 1 457456 null 452111 null 5345 null
30 sf 953 main st 2 3455 null -218 null 3673 null
31 sf 953 main st 3 345345 null 341892 null 3453 null
32 sf 953 main st 4 56457 null 52990 null 3467 null
33 sf 953 main st 5 3634 null 2846 null 788 null
The top is the data I have, the bottom is the transformation I want. 顶部是我拥有的数据,底部是我想要的转换。
I'm most comfortable in R but I'm practicing Python, so any approach works. 我最喜欢R,但是我正在练习Python,因此任何方法都可以。
It would help first if you had proper column names for your df, please insert column names once you read in the data. 如果您的df拥有正确的列名,这将首先有所帮助,请在读取数据后插入列名。
I have use the following libraries, dplyr
and stringr
for this analysis and also renamed the first 3 columns: 我已经使用以下库dplyr
和stringr
进行了此分析,并且还重命名了前3列:
df <- data.frame(stringsAsFactors=FALSE,
city = c("nyc", "nyc", "nyc"),
address = c("123 main st", "123 main st", "123 main st"),
month = c("x", "y", "z"),
X1 = c(58568L, 5345L, 53223L),
X2 = c(567567L, 3673L, 563894L),
X3 = c(567909L, 3453L, 564456L),
X4 = c(35876L, 3467L, 32409L),
X5 = c(56943L, 788L, 56155L)
)
df %>% gather(Type, Value, -c(city:month)) %>%
spread(month, Value) %>%
mutate(month = str_sub(Type, 2, 2)) %>%
select(-Type) %>%
select(c(city, address, month, x:z))
city address month x y z
1 nyc 123 main st 1 58568 5345 53223
2 nyc 123 main st 2 567567 3673 563894
3 nyc 123 main st 3 567909 3453 564456
4 nyc 123 main st 4 35876 3467 32409
5 nyc 123 main st 5 56943 788 56155
The sample data set provided by the OP suggests that all data frames within the csv file OP提供的样本数据集表明,csv文件中的所有数据帧
V4
to V8
refer to the same months 1 to 5 for all "sub frames". 并每月列V4
到V8
指同一月1至5的所有“子帧”。 If this is true then we can treat the whole csv file as one data frame and convert it to the desired format by reshaping using melt()
and dcast()
from the data.table
package: 如果是这样,那么我们可以将整个csv文件视为一个数据帧,并通过使用data.table
包中的melt()
和dcast()
进行重塑来将其转换为所需格式:
library(data.table)
setDT(df3)[, melt(.SD, id.vars = paste0("V", 1:3), na.rm = TRUE)][
V3 != "month", dcast(.SD, V1 + V2 + rleid(variable) ~ forcats::fct_inorder(V3))][
, setnames(.SD, 1:3, c("city", "address", "month"))]
city address month xyzabc 1: la 63 main st 1 NA NA NA 87035 345 86690 2: la 63 main st 2 NA NA NA 7467456 456 7467000 3: la 63 main st 3 NA NA NA 3363 345 3018 4: la 63 main st 4 NA NA NA 863 678 185 5: la 63 main st 5 NA NA NA 43673 345 43328 6: nyc 123 main st 1 58568 5345 53223 NA NA NA 7: nyc 123 main st 2 567567 3673 563894 NA NA NA 8: nyc 123 main st 3 567909 3453 564456 NA NA NA 9: nyc 123 main st 4 35876 3467 32409 NA NA NA 10: nyc 123 main st 5 56943 788 56155 NA NA NA 11: sf 953 main st 1 457456 NA 452111 NA 5345 NA 12: sf 953 main st 2 3455 NA -218 NA 3673 NA 13: sf 953 main st 3 345345 NA 341892 NA 3453 NA 14: sf 953 main st 4 56457 NA 52990 NA 3467 NA 15: sf 953 main st 5 3634 NA 2846 NA 788 NA
The fct_inorder()
function from Hadley's forcats
package is used here to order the columns by their first appearance instead of alphabetical order a, b, c, x, y, z. Hadley的forcats
包中的fct_inorder()
函数在此用于按列的首次出现对列进行排序,而不是按字母顺序对a,b,c,x,y,z进行排序。
Note that also the cities have been ordered alphabetically. 请注意,城市也按字母顺序排列。 If this is crcuial (but I doubt it is) the original order can be preserved as well by using 如果这是至关重要的(但我怀疑是这样),那么也可以通过使用来保留原始顺序
forcats::fct_inorder(V1) + V2 + rleid(variable) ~ forcats::fct_inorder(V3)
as dcast()
formula. 作为dcast()
公式。
Unfortunately, the OP didn't supply the result of dput(df3)
which made it unnecessarily difficult to reproduce the data set as printed in the question: 不幸的是,OP没有提供dput(df3)
的结果,这使得不必要地很难再现问题中打印的数据集:
df3 <- readr::read_table(
" V1 V2 V3 V4 V5 V6 V7 V8
1 nyc 123 main st month 1 2 3 4 5
2 nyc 123 main st x 58568 567567 567909 35876 56943
3 nyc 123 main st y 5345 3673 3453 3467 788
4 nyc 123 main st z 53223 563894 564456 32409 56155
5
6 la 63 main st month 1 2 3 4 5
7 la 63 main st a 87035 7467456 3363 863 43673
8 la 63 main st b 345 456 345 678 345
9 la 63 main st c 86690 7467000 3018 185 43328
10
11 sf 953 main st month 1 2 3 4 5
12 sf 953 main st x 457456 3455 345345 56457 3634
13 sf 953 main st b 5345 3673 3453 3467 788
14 sf 953 main st z 452111 -218 341892 52990 2846"
)
library(data.table)
setDT(df3)[, V2 := paste(X3, V2)][, c("X1", "X3") := NULL]
setDF(df3)[]
V1 V2 V3 V4 V5 V6 V7 V8 1 nyc 123 main st month 1 2 3 4 5 2 nyc 123 main st x 58568 567567 567909 35876 56943 3 nyc 123 main st y 5345 3673 3453 3467 788 4 nyc 123 main st z 53223 563894 564456 32409 56155 5 NA NA NA NA NA NA 6 la 63 main st month 1 2 3 4 5 7 la 63 main st a 87035 7467456 3363 863 43673 8 la 63 main st b 345 456 345 678 345 9 la 63 main st c 86690 7467000 3018 185 43328 10 NA NA NA NA NA NA 11 sf 953 main st month 1 2 3 4 5 12 sf 953 main st x 457456 3455 345345 56457 3634 13 sf 953 main st b 5345 3673 3453 3467 788 14 sf 953 main st z 452111 -218 341892 52990 2846
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.