简体   繁体   中英

convert long dataset to wide dataset using R

I'd appreciate some assistance in what R code to use in the following situation:

This is the top 11 rows of the dataset:

Sa1_main11  Sa1_main11_2
20401106101 20401106101 -
20401106101 21105128609 -
20401106101 21105128653
20601110501 20601110501
20601110501 20601110530
20601110501 20601110531
20601110501 20601110532
20601110501 20601110533
20601110501 20601110534
20601110501 20601110614
20601110502 20601110502

SA1s are a geographical unit used by the Australian Bureau of Statistics.

This file is a list of what SA1 are contiguous - column 1 is the base SA1, and the second column is the SA1 that adjoins the first SA1.

For example, take the first 3 rows

  • 20401106101 adjoins itself
  • 21105128609 adjoins 20401106101
  • 21105128653 adjoins 20401106101

What I need to do is to produce a dataset where the first line is of the format

20401106101  21105128609  21105128653

I've tried reshape2 package, but the lack of row labels (which would all be identical) makes that not possible for me.

Edit - here is a link to what the data looks like

https://www.dropbox.com/s/tigqdevybskm1bs/Original.JPG

here is a link to what the top 3 rows should look like

https://www.dropbox.com/s/b2l36mry9ibfnfq/Destination.JPG

It looks like split might help you:

split(DF[,2], DF[,1]) 

#$`20401106101`
#[1] 20401106101 21105128609 21105128653
#
#$`20601110501`
#[1] 20601110501 20601110530 20601110531 20601110532 20601110533 20601110534 20601110614
#
#$`20601110502`
#[1] 20601110502

It's unclear what you intend to do with the data. Neither data.frames nor matrices can hold rows of different length. So replicating the exact result is a bit complicated (and not very useful). Anyway, this would come close:

res <- split(DF[,2], DF[,1]) 
res <- lapply(res, function(x) {
  length(x) <- max(sapply(res, length))
  x
  })

do.call(rbind, res)
#                   [,1]        [,2]        [,3]        [,4]        [,5]        [,6]        #[,7]
#20401106101 20401106101 21105128609 21105128653          NA          NA          NA          NA
#20601110501 20601110501 20601110530 20601110531 20601110532 20601110533 20601110534 20601110614
#20601110502 20601110502          NA          NA          NA          NA          NA          NA

Check if this works: ( dat is the dataset)

 library(reshape2)
 dat$indx <- with(dat, ave(seq_along(Sa1_main11), Sa1_main11, FUN=seq_along))
 dcast(dat, Sa1_main11~indx, value.var="Sa1_main11_2")
 #     Sa1_main11           1           2           3           4           5
 #1 20401106101 20401106101 21105128609 21105128653          NA          NA
 #2 20601110501 20601110501 20601110530 20601110531 20601110532 20601110533
 #3 20601110502 20601110502          NA          NA          NA          NA
 #           6           7
 #1          NA          NA
 #2 20601110534 20601110614
 #3          NA          NA

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM