简体   繁体   English

使用重塑将数据从宽格式转换为长格式

[英]Converting data from wide to long format using reshape

I have got a csv file in the wide format that I need to change to long format. 我有一个宽格式的csv文件,我需要更改为长格式。 I have just given the first 3 rows. 我刚刚给出了前3行。

CODEA   C45 ragek   ra80    ra98    ... Obese14 Overweight14 Obese21 hibp14 hibp21 Overweight21
1   1   NA  3   4   1   NA  NA  NA  NA  NA  NA      NA  NA
2   3   2   3   3   1   0   0   0   0   1   0   0   0
3   4   2   3   6   1   NA  NA  NA  NA  NA  NA  NA  NA

This goes on . 这继续下去。 Obese 14 (Yes/No); Overweight(yes/no) Obese 14 (Yes/No); Overweight(yes/no) etc. Obese 14 (Yes/No); Overweight(yes/no)

> names(Copy.of.BP_2)

 [1] "CODEA"  "C45"                     "ragek"                   "ra80"              
 [5] "ra98"   "CBCLAggressionAt1410"    "CBCLInternalisingAt1410" "Obese14"              
 [9] "Overweight14"   "Overweight21"    "Obese21"                 "hibp14"               
[13] "hibp21"          

It has 6898 observations and 13 variables 它有6898个观测值和13个变量

I am trying to organise this data in the stacked format; 我试图以堆叠格式组织这些数据; I thought the following one would be a good option. 我认为以下是一个不错的选择。 I am not sure how to combine obese and overweight category as the original long version has obese14 , overweight14 , obese 21 and overweight21 as 4 different categories. 我不知道如何结合obeseoverweight类别与原长版本有obese14overweight14obese 21overweight21为4个不同的类别。

CODEA ...  time         Obese        Overweight      HiBP

           14 
           21
           14
           21 ... etc

I gave the syntax as: 我给出了如下语法:

BP.stack1=reshape(Copy.of.BP_2, 
   timevar="time",direction="long",
   varying=list(names(Copy.of.BP_2[8:13]),
   v.names="Obese","Overweight","HiBP",idvar=c("CODEA")

It does not seem to work, it gives a + sign and waits for further command. 它似乎不起作用,它给出一个+符号并等待进一步的命令。

Should I be using melt and cast ?. 我应该使用meltcast吗? I read the reshape package manual , but cannot understand it. 我阅读了reshape包装手册,但无法理解。

edit : question restructured 编辑 :问题重组

Sticking with base R reshape() , try the following. 坚持使用基础R reshape() ,尝试以下操作。

I think that I have recreated your example data with the following: 认为我已经使用以下内容重新创建了您的示例数据:

Copy.of.BP_2 <- 
structure(list(CODEA = c(1, 3, 4), C45 = c(NA, 2, 2), ragek = c(3, 
3, 3), ra80 = c(4, 3, 6), ra98 = c(1, 1, 1), CBCLAggressionAt1410 = c(NA, 
0, NA), CBCLInternalisingAt1410 = c(NA, 0, NA), Obese14 = c(NA, 
0, NA), Overweight14 = c(NA, 0, NA), Overweight21 = c(NA, 1, 
NA), Obese21 = c(NA, 0, NA), hibp14 = c(NA, 0, NA), hibp21 = c(NA, 
0, NA)), .Names = c("CODEA", "C45", "ragek", "ra80", "ra98", 
"CBCLAggressionAt1410", "CBCLInternalisingAt1410", "Obese14", 
"Overweight14", "Overweight21", "Obese21", "hibp14", "hibp21"
), row.names = c(NA, -3L), class = "data.frame")

Copy.of.BP_2
#   CODEA C45 ragek ra80 ra98 CBCLAggressionAt1410 CBCLInternalisingAt1410
# 1     1  NA     3    4    1                   NA                      NA
# 2     3   2     3    3    1                    0                       0
# 3     4   2     3    6    1                   NA                      NA
#   Obese14 Overweight14 Overweight21 Obese21 hibp14 hibp21
# 1      NA           NA           NA      NA     NA     NA
# 2       0            0            1       0      0      0
# 3      NA           NA           NA      NA     NA     NA

First, for convenience, let's create a vector of the measure variables--the variables that we want to "stack" from wide to long format. 首先,为方便起见,让我们创建一个度量变量的向量 - 我们想要从宽到长格式“堆叠”的变量。

measurevars <- names(Copy.of.BP_2)[grepl("Obese|Overweight|hibp", 
                                         names(Copy.of.BP_2))]

Next, use reshape() , specifying the direction, the identification variable, and which variables "vary" with time ( measurevars , from above). 接下来,使用reshape() ,指定方向,标识变量以及哪些变量随时间“变化”( measurevars ,从上面开始)。

BP_2_long <- reshape(Copy.of.BP_2, direction = "long", idvar="CODEA",
                     varying = measurevars, sep = "")
BP_2_long
#      CODEA C45 ragek ra80 ra98 CBCLAggressionAt1410 CBCLInternalisingAt1410
# 1.14     1  NA     3    4    1                   NA                      NA
# 3.14     3   2     3    3    1                    0                       0
# 4.14     4   2     3    6    1                   NA                      NA
# 1.21     1  NA     3    4    1                   NA                      NA
# 3.21     3   2     3    3    1                    0                       0
# 4.21     4   2     3    6    1                   NA                      NA
#      time Obese Overweight hibp
# 1.14   14    NA         NA   NA
# 3.14   14     0          0    0
# 4.14   14    NA         NA   NA
# 1.21   21    NA         NA   NA
# 3.21   21     0          1    0
# 4.21   21    NA         NA   NA

If you are only interested in the id column and the measure column, you can also add a drop argument into your reshape() command: 如果您只对id列和measure列感兴趣,还可以在reshape()命令中添加drop参数:

BP_2_long_2 <- reshape(
  Copy.of.BP_2, direction = "long", idvar="CODEA",
  varying = measurevars, sep = "",
  drop = !names(Copy.of.BP_2) %in% c(measurevars, "CODEA"))
BP_2_long_2
#      CODEA time Obese Overweight hibp
# 1.14     1   14    NA         NA   NA
# 3.14     3   14     0          0    0
# 4.14     4   14    NA         NA   NA
# 1.21     1   21    NA         NA   NA
# 3.21     3   21     0          1    0
# 4.21     4   21    NA         NA   NA

Update: Why your code doesn't work 更新:为什么您的代码不起作用

Here is an argument-by-argument breakdown of what you tried with comments on how you can try to fix it. 以下是关于如何尝试修复它的评论的逐个论点细分。

BP.stack1 = 
reshape(Copy.of.BP_2,                    # Fine
timevar="time",                          # Fine
direction="long",                        # Fine
varying=list(names(Copy.of.BP_2)[8:13]), # Wrong. Use "varying = 8:13" instead
v.names="Obese","Overweight","HiBP",     # Wrong. This needs to be in c()
idvar=c("CODEA")                         # Almost... missing your closing ")"

Thus, to get a complete working command: 因此,要获得完整的工作命令:

BP.stack1 = reshape(
  Copy.of.BP_2, 
  timevar="time", 
  direction="long", 
  varying=8:13, 
  v.names=c("Obese", "Overweight", "HiBP"),
  idvar=c("CODEA"))

I generally try to not depend too much on the column number since those are more likely to be rearranged than the columns are to be renamed. 我通常会尝试不要过多地依赖列号,因为这些列重新排列比重命名列更容易。 Hence my use of grepl() to match names according to a certain pattern. 因此我使用grepl()根据特定模式匹配名称。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM