简体   繁体   中英

How to combine strings in rows based on position condition

It proved difficult to find search terms for this kind of question. I need to write a script that can make all combinations of strings of each row in a data frame. It should use each string once , and only make combinations of strings that are two steps away from the first one. The first and the last column are in reality next to each other. Hence they can also not be combined (it is a circle of strings in reality). This same script needs to be applied to data frames of different even amounts of columns, here is an example with 8.

I have only managed to make it manually for a data frame with a given number of column, but not an expression that would work for a data frame of any number of columns.

This is the type of data:

  Crop_1    Crop_2      Crop_3      Crop_4  Crop_5 Crop_6 Crop_7 Crop_8
1 Potato     Onion   Sugarbeet Grassclover Cabbage Potato  Wheat Carrot
2 Potato Sugarbeet Grassclover      Potato Cabbage  Onion Carrot  Wheat

The desired outcome in this case should be these 6 options:

                  Pair_1            Pair_2              Pair_3             Pair_4 Crop_1    Crop_2      Crop_3      Crop_4  Crop_5 Crop_6 Crop_7 Crop_8
1   Potato-Sugarbeet Onion-Grassclover       Cabbage-Wheat      Potato-Carrot Potato     Onion   Sugarbeet Grassclover Cabbage Potato  Wheat Carrot
2 Potato-Grassclover  Sugarbeet-Potato      Cabbage-Carrot        Onion-Wheat Potato Sugarbeet Grassclover      Potato Cabbage  Onion Carrot  Wheat
3       Potato-Wheat      Onion-Carrot   Sugarbeet-Cabbage Grassclover-Potato Potato     Onion   Sugarbeet Grassclover Cabbage Potato  Wheat Carrot
4      Potato-Carrot   Sugarbeet-Wheat Grassclover-Cabbage       Potato-Onion Potato Sugarbeet Grassclover      Potato Cabbage  Onion Carrot  Wheat
5     Potato-Cabbage      Onion-Potato     Sugarbeet-Wheat Grassclover-Carrot Potato     Onion   Sugarbeet Grassclover Cabbage Potato  Wheat Carrot
6     Potato-Cabbage   Sugarbeet-Onion  Grassclover-Carrot       Potato-Wheat Potato Sugarbeet Grassclover      Potato Cabbage  Onion Carrot  Wheat

The data frame can be retrieved here:

structure(list(Crop_1 = structure(c(1L, 1L), .Label = "Potato", class = "factor"), 
    Crop_2 = structure(1:2, .Label = c("Onion", "Sugarbeet"), class = "factor"), 
    Crop_3 = structure(2:1, .Label = c("Grassclover", "Sugarbeet"
    ), class = "factor"), Crop_4 = structure(1:2, .Label = c("Grassclover", 
    "Potato"), class = "factor"), Crop_5 = structure(c(1L, 1L
    ), .Label = "Cabbage", class = "factor"), Crop_6 = structure(2:1, .Label = c("Onion", 
    "Potato"), class = "factor"), Crop_7 = structure(2:1, .Label = c("Carrot", 
    "Wheat"), class = "factor"), Crop_8 = structure(1:2, .Label = c("Carrot", 
    "Wheat"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

Here's a function that does the trick. What you need to deal with is even numbers that are divisible by four, and those that aren't. For those that are divisible by four, you can just group them into fours and take two pairs as you have done. We use seq.int to get the starts of each pair, and then use setdiff to get the ends. For those that aren't, treat the first 6 specially (matching 1-4, 2-5, 3-6) and then do the rest like the fours.

The rest of the complexity is just making sure that you can accept a tibble and return a tibble , since that's what's expected by nest and unnest .

library(tidyverse)
tbl <- structure(list(Crop_1 = c("Potato", "Potato"), Crop_2 = c("Onion", "Sugarbeet"), Crop_3 = c("Sugarbeet", "Grassclover"), Crop_4 = c("Grassclover", "Potato"), Crop_5 = c("Cabbage", "Cabbage"), Crop_6 = c("Potato", "Onion"), Crop_7 = c("Wheat", "Carrot"), Crop_8 = c("Carrot", "Wheat")), class = "data.frame", row.names = c(NA, -2L))

pair_crops <- function(crop_row) {
  crop_set <- as.character(crop_row)
  n_crops <- length(crop_set)
  if (n_crops %% 2 == 1) {
    stop("Odd number of crops!")
  } else if (n_crops %% 4 == 0) {
    starts <- sort(c(seq.int(1, n_crops, 4), seq.int(2, n_crops, 4)))
  } else {
    starts <- sort(c(1:3,seq.int(7, n_crops, 4), seq.int(8, n_crops, 4)))
  }
  ends <- setdiff(1:n_crops, starts)
  tibble(
    pair = str_c(crop_set[starts], "-", crop_set[ends]),
    name = str_c("Pair_", 1:length(starts))
  ) %>%
    spread(name, pair)
}

tbl %>%
  rowid_to_column %>%
  nest(-rowid, .key = "crop") %>%
  mutate(pairs = map(crop, pair_crops)) %>%
  unnest()
#>   rowid Crop_1    Crop_2      Crop_3      Crop_4  Crop_5 Crop_6 Crop_7
#> 1     1 Potato     Onion   Sugarbeet Grassclover Cabbage Potato  Wheat
#> 2     2 Potato Sugarbeet Grassclover      Potato Cabbage  Onion Carrot
#>   Crop_8             Pair_1            Pair_2         Pair_3        Pair_4
#> 1 Carrot   Potato-Sugarbeet Onion-Grassclover  Cabbage-Wheat Potato-Carrot
#> 2  Wheat Potato-Grassclover  Sugarbeet-Potato Cabbage-Carrot   Onion-Wheat

Created on 2019-04-19 by the reprex package (v0.2.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM