简体   繁体   中英

Tidying same ordered data placed in different rows across different columns

I have made a bulk import that results in df that ends up in a non-tidy fashion. The problem lies on the original databases, and on that end there is nothing I can do.

After import and rotate_df I end up with this

x   y   z
1   NA  NA
2   NA  NA
NA  1   NA   
NA  2   NA
NA  NA  1
NA  NA  2

Basically, instead a new file's data is being added stackedly, the data of each additional column (after rotate) only starts at +34 rows than the previous one.

Given I can not change the original databases using, is there any jedi trick that allows changing my current df to this?

x   y   z
1   1   1
2   2   2

As each collumn has the same information, it would allow for my df to be tidy.

Thank you.

You could try

as.data.frame(lapply(df, na.omit))

#   x y z
# 1 1 1 1
# 2 2 2 2

Update: It seems that all variables do not have the same length after NA s are removed. Take the following data as example:

x   y   z
1   NA  NA
2   NA  NA
NA  1   NA   
NA  2   NA
NA  NA  1
NA  NA  2
NA  NA  3

What's the expected output for it? Keep the 3 like

#  x  y  z
#  1  1  1
#  2  2  2
# NA AN  3

or discard it

#  x  y  z
#  1  1  1
#  2  2  2

or use other rules?

Another possible solution, based on purrr::map_dfc :

library(purrr)

map_dfc(df, na.omit)

#> # A tibble: 2 × 3
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     1     1
#> 2     2     2     2

If you're getting vector mismatch errors, it suggests that some of your "correct" values are also NAs. These could be NULLs in your database, or values that get turned into NA by R because of a type mismatch: make sure you know what it happening and that you're not losing data in the import.

That said, this means you'll probably want to use the other piece of information you have - that the columns are staggered by a reproducible distance.

df <-
structure(list(x = c(1L, 2L, NA, NA, NA, NA), y = c(NA, NA, 1L, 
2L, NA, NA), z = c(NA, NA, NA, NA, 1L, 2L)), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L), class = "data.frame")

stagger <- 2 # this would be 34 in your real data

df_compact <- as.data.frame(lapply(1:ncol(df), function(i) 
    df[((i-1)*stagger+1) : (i*stagger), i]
))
colnames(df_compact) <- colnames(df)

 df_compact
  x y z
1 1 1 1
2 2 2 2

This will only work correctly if nrow(df) = stagger * ncol(df) . If not, you will not get an error but all sorts of horrid imputed NA s, so be very careful and find some other way to validate your data afterwards.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM