简体   繁体   中英

Split a column in dataframe in multiple columns (different length) using a delimeter

I have this table:

cca2    ccn3    cca3    borders
AX      248     ALA 
AL      8       ALB     MNE,GRC,MKD,UNK
AD      20      AND     FRA,ESP
AT      40      AUT     CZE,DEU,HUN,ITA,LIE,SVK,SVN,CHE
BE      56      BEL     FRA,DEU,LUX,NLD

and would like to seperate borders in multiple columns. As you can see the borders do not have the same amount of data.

I tried:

newCountries <- data.frame(do.call('rbind', strsplit(as.character(countries$borders),',',fixed=TRUE)))

but didn't work well...how can I solve this pls?

I would like the result to look like this:

cca2    ccn3    cca3    b1   b2   b3  b4  b5  b6  b7  b8
AX      248     ALA     NA   NA   NA  NA  NA  NA  NA  NA
AL      8       ALB     MNE  GRC  MKD UNK NA  NA  NA  NA
AD      20      AND     FRA  ESP  NA  NA  NA  NA  NA  NA
AT      40      AUT     CZE  DEU  HUN ITA LIE SVK SVN CHE
BE      56      BEL     FRA  DEU  LUX NLD NA  NA  NA  NA

Here are two ways.

The first is mostly base R, but borrows separate from tidyr (ships with tidyverse ). For this, I used sapply to split the strings in each value of borders , then took the maximum length of those. In this case, that's 8 borders. Then I used this to determine the column names for separate . I think separate is a handy function, but it's sometimes tricky if you don't know exactly how many columns you'll need.

The second way is dplyr -based, where I split the strings in borders , unnest ed it into a long data frame, created column numbers based on how many entries there were for each value of cca2 , and used spread to get it back into a wide format.

library(tidyverse)


max_borders <- max(sapply(df$borders, function(x) length(strsplit(x, ",")[[1]]), simplify = T))
tidyr::separate(df, borders, into = paste0("b", 1:max_borders), sep = ",")
#> Warning: Expected 8 pieces. Missing pieces filled with `NA` in 3 rows [2,
#> 3, 5].
#> # A tibble: 5 x 11
#>   cca2   ccn3 cca3  b1    b2    b3    b4    b5    b6    b7    b8   
#>   <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AX      248 ALA   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 2 AL        8 ALB   MNE   GRC   MKD   UNK   <NA>  <NA>  <NA>  <NA> 
#> 3 AD       20 AND   FRA   ESP   <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 4 AT       40 AUT   CZE   DEU   HUN   ITA   LIE   SVK   SVN   CHE  
#> 5 BE       56 BEL   FRA   DEU   LUX   NLD   <NA>  <NA>  <NA>  <NA>


df %>%
    mutate(border_list = str_split(borders, ",")) %>%
    unnest(border_list) %>%
    select(-borders) %>%
    group_by(cca2) %>%
    mutate(col = paste0("b", row_number())) %>%
    spread(key = col, value = border_list)
#> # A tibble: 5 x 11
#> # Groups:   cca2 [5]
#>   cca2   ccn3 cca3  b1    b2    b3    b4    b5    b6    b7    b8   
#>   <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AD       20 AND   FRA   ESP   <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 2 AL        8 ALB   MNE   GRC   MKD   UNK   <NA>  <NA>  <NA>  <NA> 
#> 3 AT       40 AUT   CZE   DEU   HUN   ITA   LIE   SVK   SVN   CHE  
#> 4 AX      248 ALA   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 5 BE       56 BEL   FRA   DEU   LUX   NLD   <NA>  <NA>  <NA>  <NA>

Created on 2018-05-08 by the reprex package (v0.2.0).

Another option provides cSplit from the splitstackshape package.

library(splitstackshape)
df <- cSplit(indt = df, splitCols = "borders", sep = ",", direction = "wide")
names(df) <- c(names(df)[1:3], paste0("b", 1:8)) #optional
df
#   cca2 ccn3 cca3   b1   b2   b3   b4   b5   b6   b7   b8
#1:   AX  248  ALA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#2:   AL    8  ALB  MNE  GRC  MKD  UNK <NA> <NA> <NA> <NA>
#3:   AD   20  AND  FRA  ESP <NA> <NA> <NA> <NA> <NA> <NA>
#4:   AT   40  AUT  CZE  DEU  HUN  ITA  LIE  SVK  SVN  CHE
#5:   BE   56  BEL  FRA  DEU  LUX  NLD <NA> <NA> <NA> <NA>

data

df <- structure(list(cca2 = structure(c(4L, 2L, 1L, 3L, 5L), .Label = c("AD", 
"AL", "AT", "AX", "BE"), class = "factor"), ccn3 = c(248L, 8L, 
20L, 40L, 56L), cca3 = structure(1:5, .Label = c("ALA", "ALB", 
"AND", "AUT", "BEL"), class = "factor"), borders = structure(c(NA, 
4L, 3L, 1L, 2L), .Label = c("CZE,DEU,HUN,ITA,LIE,SVK,SVN,CHE", 
"FRA,DEU,LUX,NLD", "FRA,ESP", "MNE,GRC,MKD,UNK"), class = "factor")), .Names = c("cca2", 
"ccn3", "cca3", "borders"), class = "data.frame", row.names = c(NA, 
-5L))

Here is one more way similar to camille's, but using separate_rows from tidyr which is similar to unnest but for delimited strings, like in this case. This means we can avoid using str_split and then unnest . We can then create column names and spread in much the same way.

library(tidyverse)
df <- read_table2(
  "cca2    ccn3    cca3    borders
  AX      248     ALA 
  AL      8       ALB     MNE,GRC,MKD,UNK
  AD      20      AND     FRA,ESP
  AT      40      AUT     CZE,DEU,HUN,ITA,LIE,SVK,SVN,CHE
  BE      56      BEL     FRA,DEU,LUX,NLD"
)

df %>%
  separate_rows(borders, sep = ",") %>%
  group_by(cca2) %>%
  mutate(b = row_number()) %>%
  spread(b, borders, sep = "")
#> # A tibble: 5 x 11
#> # Groups:   cca2 [5]
#>   cca2   ccn3 cca3  b1    b2    b3    b4    b5    b6    b7    b8   
#>   <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AD       20 AND   FRA   ESP   <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 2 AL        8 ALB   MNE   GRC   MKD   UNK   <NA>  <NA>  <NA>  <NA> 
#> 3 AT       40 AUT   CZE   DEU   HUN   ITA   LIE   SVK   SVN   CHE  
#> 4 AX      248 ALA   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 5 BE       56 BEL   FRA   DEU   LUX   NLD   <NA>  <NA>  <NA>  <NA>

Created on 2018-05-08 by the reprex package (v0.2.0).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM