I have this table:
cca2 ccn3 cca3 borders
AX 248 ALA
AL 8 ALB MNE,GRC,MKD,UNK
AD 20 AND FRA,ESP
AT 40 AUT CZE,DEU,HUN,ITA,LIE,SVK,SVN,CHE
BE 56 BEL FRA,DEU,LUX,NLD
and would like to seperate borders in multiple columns. As you can see the borders do not have the same amount of data.
I tried:
newCountries <- data.frame(do.call('rbind', strsplit(as.character(countries$borders),',',fixed=TRUE)))
but didn't work well...how can I solve this pls?
I would like the result to look like this:
cca2 ccn3 cca3 b1 b2 b3 b4 b5 b6 b7 b8
AX 248 ALA NA NA NA NA NA NA NA NA
AL 8 ALB MNE GRC MKD UNK NA NA NA NA
AD 20 AND FRA ESP NA NA NA NA NA NA
AT 40 AUT CZE DEU HUN ITA LIE SVK SVN CHE
BE 56 BEL FRA DEU LUX NLD NA NA NA NA
Here are two ways.
The first is mostly base R, but borrows separate
from tidyr
(ships with tidyverse
). For this, I used sapply
to split the strings in each value of borders
, then took the maximum length of those. In this case, that's 8 borders. Then I used this to determine the column names for separate
. I think separate
is a handy function, but it's sometimes tricky if you don't know exactly how many columns you'll need.
The second way is dplyr
-based, where I split the strings in borders
, unnest
ed it into a long data frame, created column numbers based on how many entries there were for each value of cca2
, and used spread
to get it back into a wide format.
library(tidyverse)
max_borders <- max(sapply(df$borders, function(x) length(strsplit(x, ",")[[1]]), simplify = T))
tidyr::separate(df, borders, into = paste0("b", 1:max_borders), sep = ",")
#> Warning: Expected 8 pieces. Missing pieces filled with `NA` in 3 rows [2,
#> 3, 5].
#> # A tibble: 5 x 11
#> cca2 ccn3 cca3 b1 b2 b3 b4 b5 b6 b7 b8
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AX 248 ALA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 AL 8 ALB MNE GRC MKD UNK <NA> <NA> <NA> <NA>
#> 3 AD 20 AND FRA ESP <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 AT 40 AUT CZE DEU HUN ITA LIE SVK SVN CHE
#> 5 BE 56 BEL FRA DEU LUX NLD <NA> <NA> <NA> <NA>
df %>%
mutate(border_list = str_split(borders, ",")) %>%
unnest(border_list) %>%
select(-borders) %>%
group_by(cca2) %>%
mutate(col = paste0("b", row_number())) %>%
spread(key = col, value = border_list)
#> # A tibble: 5 x 11
#> # Groups: cca2 [5]
#> cca2 ccn3 cca3 b1 b2 b3 b4 b5 b6 b7 b8
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AD 20 AND FRA ESP <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 AL 8 ALB MNE GRC MKD UNK <NA> <NA> <NA> <NA>
#> 3 AT 40 AUT CZE DEU HUN ITA LIE SVK SVN CHE
#> 4 AX 248 ALA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 BE 56 BEL FRA DEU LUX NLD <NA> <NA> <NA> <NA>
Created on 2018-05-08 by the reprex package (v0.2.0).
Another option provides cSplit
from the splitstackshape
package.
library(splitstackshape)
df <- cSplit(indt = df, splitCols = "borders", sep = ",", direction = "wide")
names(df) <- c(names(df)[1:3], paste0("b", 1:8)) #optional
df
# cca2 ccn3 cca3 b1 b2 b3 b4 b5 b6 b7 b8
#1: AX 248 ALA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#2: AL 8 ALB MNE GRC MKD UNK <NA> <NA> <NA> <NA>
#3: AD 20 AND FRA ESP <NA> <NA> <NA> <NA> <NA> <NA>
#4: AT 40 AUT CZE DEU HUN ITA LIE SVK SVN CHE
#5: BE 56 BEL FRA DEU LUX NLD <NA> <NA> <NA> <NA>
data
df <- structure(list(cca2 = structure(c(4L, 2L, 1L, 3L, 5L), .Label = c("AD",
"AL", "AT", "AX", "BE"), class = "factor"), ccn3 = c(248L, 8L,
20L, 40L, 56L), cca3 = structure(1:5, .Label = c("ALA", "ALB",
"AND", "AUT", "BEL"), class = "factor"), borders = structure(c(NA,
4L, 3L, 1L, 2L), .Label = c("CZE,DEU,HUN,ITA,LIE,SVK,SVN,CHE",
"FRA,DEU,LUX,NLD", "FRA,ESP", "MNE,GRC,MKD,UNK"), class = "factor")), .Names = c("cca2",
"ccn3", "cca3", "borders"), class = "data.frame", row.names = c(NA,
-5L))
Here is one more way similar to camille's, but using separate_rows
from tidyr
which is similar to unnest
but for delimited strings, like in this case. This means we can avoid using str_split
and then unnest
. We can then create column names and spread
in much the same way.
library(tidyverse)
df <- read_table2(
"cca2 ccn3 cca3 borders
AX 248 ALA
AL 8 ALB MNE,GRC,MKD,UNK
AD 20 AND FRA,ESP
AT 40 AUT CZE,DEU,HUN,ITA,LIE,SVK,SVN,CHE
BE 56 BEL FRA,DEU,LUX,NLD"
)
df %>%
separate_rows(borders, sep = ",") %>%
group_by(cca2) %>%
mutate(b = row_number()) %>%
spread(b, borders, sep = "")
#> # A tibble: 5 x 11
#> # Groups: cca2 [5]
#> cca2 ccn3 cca3 b1 b2 b3 b4 b5 b6 b7 b8
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AD 20 AND FRA ESP <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 AL 8 ALB MNE GRC MKD UNK <NA> <NA> <NA> <NA>
#> 3 AT 40 AUT CZE DEU HUN ITA LIE SVK SVN CHE
#> 4 AX 248 ALA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 BE 56 BEL FRA DEU LUX NLD <NA> <NA> <NA> <NA>
Created on 2018-05-08 by the reprex package (v0.2.0).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.