简体   繁体   English

从宽到长的数据表转换,在列和行中具有变量

[英]wide to long data table transformation with variables in columns and rows

I have a csv with multiple tables with variables stored in both rows and columns. 我有一个带有多个表的csv,其中变量存储在行和列中。
About this csv: 关于此csv:

  1. I'd want to go "wide" to "long" 我想从“宽”到“长”
  2. There are multiple "data frames" in one csv 一个csv中有多个“数据帧”
  3. There are different types of variables for each "data frames" 每个“数据框”都有不同类型的变量

> df3
     V1          V2    V3     V4      V5     V6      V7    V8
1   nyc 123 main st month      1       2      3       4     5
2   nyc 123 main st     x  58568  567567 567909   35876 56943
3   nyc 123 main st     y   5345    3673   3453    3467   788
4   nyc 123 main st     z  53223  563894 564456   32409 56155
5                                                            
6    la  63 main st month      1       2      3       4     5
7    la  63 main st     a  87035 7467456   3363     863 43673
8    la  63 main st     b    345     456    345     678   345
9    la  63 main st     c  86690 7467000   3018     185 43328
10                                                           
11   sf 953 main st month      1       2      3       4     5
12   sf 953 main st     x 457456    3455 345345   56457  3634
13   sf 953 main st     b   5345    3673   3453    3467   788
14   sf 953 main st     z 452111    -218 341892   52990  2846

> df4
18 city     address month      x       y      z       a     b       c
19  nyc 123 main st     1  58568    5345  53223    null  null    null
20  nyc 123 main st     2 567567    3673 563894    null  null    null
21  nyc 123 main st     3 567909    3453 564456    null  null    null
22  nyc 123 main st     4  35876    3467  32409    null  null    null
23  nyc 123 main st     5  56943     788  56155    null  null    null
24   la  63 main st     1   null    null   null   87035   345   86690
25   la  63 main st     2   null    null   null 7467456   456 7467000
26   la  63 main st     3   null    null   null    3363   345    3018
27   la  63 main st     4   null    null   null     863   678     185
28   la  63 main st     5   null    null   null   43673   345   43328
29   sf 953 main st     1 457456    null 452111    null  5345    null
30   sf 953 main st     2   3455    null   -218    null  3673    null
31   sf 953 main st     3 345345    null 341892    null  3453    null
32   sf 953 main st     4  56457    null  52990    null  3467    null
33   sf 953 main st     5   3634    null   2846    null   788    null

The top is the data I have, the bottom is the transformation I want. 顶部是我拥有的数据,底部是我想要的转换。

I'm most comfortable in R but I'm practicing Python, so any approach works. 我最喜欢R,但是我正在练习Python,因此任何方法都可以。

It would help first if you had proper column names for your df, please insert column names once you read in the data. 如果您的df拥有正确的列名,这将首先有所帮助,请在读取数据后插入列名。

I have use the following libraries, dplyr and stringr for this analysis and also renamed the first 3 columns: 我已经使用以下库dplyrstringr进行了此分析,并且还重命名了前3列:

df <- data.frame(stringsAsFactors=FALSE,
        city = c("nyc", "nyc", "nyc"),
     address = c("123 main st", "123 main st", "123 main st"),
       month = c("x", "y", "z"),
          X1 = c(58568L, 5345L, 53223L),
          X2 = c(567567L, 3673L, 563894L),
          X3 = c(567909L, 3453L, 564456L),
          X4 = c(35876L, 3467L, 32409L),
          X5 = c(56943L, 788L, 56155L)
)

df %>% gather(Type, Value, -c(city:month)) %>% 
        spread(month, Value) %>%
        mutate(month = str_sub(Type, 2, 2)) %>%
        select(-Type) %>%
        select(c(city, address, month, x:z))

city     address month      x    y      z
1  nyc 123 main st     1  58568 5345  53223
2  nyc 123 main st     2 567567 3673 563894
3  nyc 123 main st     3 567909 3453 564456
4  nyc 123 main st     4  35876 3467  32409
5  nyc 123 main st     5  56943  788  56155

The sample data set provided by the OP suggests that all data frames within the csv file OP提供的样本数据集表明,csv文件中的所有数据帧

  1. have the same structure, ie, the same number, names, and positions of columns 具有相同的结构,即列的编号,名称和位置相同
  2. and the monthly columns V4 to V8 refer to the same months 1 to 5 for all "sub frames". 每月列V4V8指同一月1至5的所有“子帧”。

If this is true then we can treat the whole csv file as one data frame and convert it to the desired format by reshaping using melt() and dcast() from the data.table package: 如果是这样,那么我们可以将整个csv文件视为一个数据帧,并通过使用data.table包中的melt()dcast()进行重塑来将其转换为所需格式:

library(data.table)
setDT(df3)[, melt(.SD, id.vars = paste0("V", 1:3), na.rm = TRUE)][
  V3 != "month", dcast(.SD, V1 + V2 + rleid(variable) ~ forcats::fct_inorder(V3))][
    , setnames(.SD, 1:3, c("city", "address", "month"))]
  city address month xyzabc 1: la 63 main st 1 NA NA NA 87035 345 86690 2: la 63 main st 2 NA NA NA 7467456 456 7467000 3: la 63 main st 3 NA NA NA 3363 345 3018 4: la 63 main st 4 NA NA NA 863 678 185 5: la 63 main st 5 NA NA NA 43673 345 43328 6: nyc 123 main st 1 58568 5345 53223 NA NA NA 7: nyc 123 main st 2 567567 3673 563894 NA NA NA 8: nyc 123 main st 3 567909 3453 564456 NA NA NA 9: nyc 123 main st 4 35876 3467 32409 NA NA NA 10: nyc 123 main st 5 56943 788 56155 NA NA NA 11: sf 953 main st 1 457456 NA 452111 NA 5345 NA 12: sf 953 main st 2 3455 NA -218 NA 3673 NA 13: sf 953 main st 3 345345 NA 341892 NA 3453 NA 14: sf 953 main st 4 56457 NA 52990 NA 3467 NA 15: sf 953 main st 5 3634 NA 2846 NA 788 NA 

The fct_inorder() function from Hadley's forcats package is used here to order the columns by their first appearance instead of alphabetical order a, b, c, x, y, z. Hadley的forcats包中的fct_inorder()函数在此用于按列的首次出现对列进行排序,而不是按字母顺序对a,b,c,x,y,z进行排序。

Note that also the cities have been ordered alphabetically. 请注意,城市也按字母顺序排列。 If this is crcuial (but I doubt it is) the original order can be preserved as well by using 如果这是至关重要的(但我怀疑是这样),那么也可以通过使用来保留原始顺序

forcats::fct_inorder(V1) + V2 + rleid(variable) ~ forcats::fct_inorder(V3)

as dcast() formula. 作为dcast()公式。

Data 数据

Unfortunately, the OP didn't supply the result of dput(df3) which made it unnecessarily difficult to reproduce the data set as printed in the question: 不幸的是,OP没有提供dput(df3)的结果,这使得不必要地很难再现问题中打印的数据集:

df3 <- readr::read_table(
  "     V1          V2    V3     V4      V5     V6      V7    V8
  1   nyc 123 main st month      1       2      3       4     5
  2   nyc 123 main st     x  58568  567567 567909   35876 56943
  3   nyc 123 main st     y   5345    3673   3453    3467   788
  4   nyc 123 main st     z  53223  563894 564456   32409 56155
  5                                                            
  6    la  63 main st month      1       2      3       4     5
  7    la  63 main st     a  87035 7467456   3363     863 43673
  8    la  63 main st     b    345     456    345     678   345
  9    la  63 main st     c  86690 7467000   3018     185 43328
  10                                                           
  11   sf 953 main st month      1       2      3       4     5
  12   sf 953 main st     x 457456    3455 345345   56457  3634
  13   sf 953 main st     b   5345    3673   3453    3467   788
  14   sf 953 main st     z 452111    -218 341892   52990  2846"
)
library(data.table)
setDT(df3)[, V2 := paste(X3, V2)][, c("X1", "X3") := NULL]
setDF(df3)[]
  V1 V2 V3 V4 V5 V6 V7 V8 1 nyc 123 main st month 1 2 3 4 5 2 nyc 123 main st x 58568 567567 567909 35876 56943 3 nyc 123 main st y 5345 3673 3453 3467 788 4 nyc 123 main st z 53223 563894 564456 32409 56155 5 NA NA NA NA NA NA 6 la 63 main st month 1 2 3 4 5 7 la 63 main st a 87035 7467456 3363 863 43673 8 la 63 main st b 345 456 345 678 345 9 la 63 main st c 86690 7467000 3018 185 43328 10 NA NA NA NA NA NA 11 sf 953 main st month 1 2 3 4 5 12 sf 953 main st x 457456 3455 345345 56457 3634 13 sf 953 main st b 5345 3673 3453 3467 788 14 sf 953 main st z 452111 -218 341892 52990 2846 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM