简体   繁体   中英

problems to set a data frame using dplyr, tidyr. data.table and reshape

I have a really huge data set, I share it by a link because I don't know any other way of showing it to you in another way. I need that the file looks like this . The second link is an example of the total file because it is really long to do it "by hand".

It hs been suggested to me to try to do this But it seems to me that my example in that post wasn't enough because with any of the proposals I am getting the result that I need. I've been trying for a week and I really don't know how to solve it, so I have decided to post my real data using a link in case that is more helpful. When I try Using dplyr and tidyr I get this warning message

d<-read.csv("m.tot3.csv",header=TRUE, sep=",",dec=".")
df<-data.frame(d)

library(dplyr)
library(tidyr)
library(data.table)

sub1 <- df[c(TRUE, FALSE),]
sub2 <- df[c(FALSE, TRUE),]

tibble(ind = c(row(sub1)), col1 = factor(unlist(sub1), levels = letters[1:1688]), 
       col2 = as.integer(unlist(sub2))) %>% 
  pivot_wider(names_from = col1, values_from = col2,
              values_fill = list(col2 = 0)) %>%
  select(-ind)

I get this error message

Error: Can't convert <double> to <list>.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
Values in `col2` are not uniquely identified; output will contain list-cols.
 Use `values_fn = list(col2 = list)` to suppress this warning.
 Use `values_fn = list(col2 = length)` to identify where the duplicates arise
 Use `values_fn = list(col2 = summary_fun)` to summarise duplicates

Using reshape

sub1 <- df[c(TRUE, FALSE),]
sub2 <- df[c(FALSE, TRUE),]

out <- reshape(
  data.frame(ind = c(row(sub1)), 
             col1 = factor(unlist(sub1), levels = letters[1:1688]), 
             col2 = as.integer(unlist(sub2))),
  idvar = 'ind', direction = 'wide', timevar = 'col1')[-1]

names(out) <- sub("col2\\.", "", names(out))
out[is.na(out)] <- 0
row.names(out) <- NULL

I get this warning message

    Warning messages:
1: In reshapeWide(data, idvar = idvar, timevar = timevar, varying = varying,  :
  there are records with missing times, which will be dropped.
2: In reshapeWide(data, idvar = idvar, timevar = timevar, varying = varying,  :
  multiple rows match for col1=NA: first taken`

finally, using data.table

 d_test<-melt(
   setDT(
    setnames(
      data.table::transpose(df), 
      paste(rep(1:(nrow(d)/2), each = 2), c("name", "value"), sep = "_"))),
  measure = patterns("name", "value"))[
    , dcast(.SD, variable ~ value1, value.var = "value2", fill = 0)]

I get this

I really don't know how to solve it and any answer is really welcome Regards

One of the issue is that the factor conversion with levels return all NA because the levels are not matching with unique values in the dataset

library(dplyr)
library(tidyr)
library(data.table)
df1 <- tibble(ind = c(row(sub1)), 
       col1 = factor(unlist(sub1), levels = unique(unlist(sub1))), 
      col2 = as.integer(unlist(sub2)))

Second issue is there are duplicates, so we create a sequence column by 'col1'

out <- df1 %>% 
    mutate(rn = rowid(col1)) %>%
    pivot_wider(names_from = col1, values_from = col2,
           values_fill = list(col2 = 0)) %>%
    select(-rn)

dim(out)
#[1]   23 3704


out[1:5, 1:5]
# A tibble: 5 x 5
#    ind  `69`  `70`  `71`  `82`
#  <int> <int> <int> <int> <int>
#1     1     2     0     0     0
#2     2     0     4     0     0
#3     3     0     0     6     0
#4     4     0     0     0     8
#5     5     0     0     0     0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM