简体   繁体   中英

Idiomatic R for splitting a column that may be splitted into list/vector with irregular length, in a dataframe or equivalent?

I'm struggling to implement the following requirement in R:

For example, for the following dataframe:

Multiple_rows <- data.frame(rbind(c("FLASH, SWAP.", "Memory: FLASH"), c("FLASH, , ,, SWAP.", "Memory: FLASH")))
colnames(Multiple_rows)<- c("VARIANTS", "STANDARD")
Multiple_rows
#           VARIANTS      STANDARD
#1      FLASH, SWAP. Memory: FLASH
#2 FLASH, , ,, SWAP. Memory: FLASH
  1. For each row, split the value at the column VARIANTS, which contains ',' as separator.
  2. With the resulted list of strings, for each element in it, trim the white spaces in both front, and end, and filter out those blank element.
  3. With the the cleaned list, for each element in it, create a row with two column: column STANDARD with the original value of the row being processed, column VARIANT with the element in question.
  4. Collect all those newly created rows into a new table/dataframe.

So for the above example, I'd like to have the following as the result:

#     VARIANT STANDARD      
#1 "FLASH" "Memory FLASH"
#2 "SWAP." "Memory FLASH"
#3 "FLASH" "Memory FLASH"
#4 "SWAP." "Memory FLASH"

The order of the rows does not matter.

Below is my implementation of in Clojure (to illustrate my requirement):

(def Multple-rows 
[{:VARIANTS "FLASH, SWAP.", :STANDARD "Memory: FLASH"}
{:VARIANTS "FLASH, , ,, SWAP.", :STANDARD "Memory: FLASH"}]) ;; This is my input. The input equivalent to a data frame with 2 column of "STANDARD", and "VARIANTS"

(defn variants-decomposed [a_map_raw] ;; process each row of the input data
  (if-let [variants (:VARIANTS a_map_raw)]
    (if (clojure.string/blank? variants)
      [{:STANDARD (:STANDARD a_map_raw), :VARIANT nil}]
      (let [standard (:STANDARD a_map_raw)
            splitted (-> (clojure.string/split variants #"[,]")
                         ((fn [list-variant] (map #(clojure.string/trim %) list-variant)), )
                          ((fn [list-variant] (filter #(not (clojure.string/blank? %)) list-variant)), ))]
        (if (seq splitted) ;; not empty
          (for [v splitted] {:STANDARD standard, :VARIANT v})
          [{:STANDARD standard, :VARIANT nil}]
           )))
    [{:VARIANT nil, :STANDARD (:STANDARD a_map_raw)}])
    )
(defn multiple-variant-maps [map_of_variants] ;; the processing to each row and collect the result
  (-> (map variants-decomposed map_of_variants)
      ((fn [list-of-vectors] (apply concat list-of-vectors)), )))))

(multiple-variant-maps Multple-rows) ;; This is my required result, which is equivalent to a data frame of 2 columns of "STANDARD", and "VARIANT".

Here are the results of the above computation:

({:STANDARD "Memory: FLASH", :VARIANT "FLASH"} 
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."} 
{:STANDARD "Memory: FLASH", :VARIANT "FLASH"} 
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."})

I wish that I could do the equivalent in R idiomatically. So far, I have struggled to get the following, but it still does not handle the irregularity of blank variants, etc.

dictionary.cleaned <- function(t) {
    variants.splitted <- sapply(data.frame(do.call('rbind', strsplit(t[, "VARIANTS"], "[,]"))), str_trim)
    melted <- melt(data.frame(dplyr::select(t, -VARIANTS), variants.splitted), id.vars = "STANDARD")

    colnames(melted)[colnames(melted)== "value"] <- "VARIANT"
    melted
  }

Here are the result of the above R code:

> dictionary.cleaned(Multiple_rows)
        STANDARD variable VARIANT
1  Memory: FLASH       X1   FLASH
2  Memory: FLASH       X1   FLASH
3  Memory: FLASH       X2   SWAP.
4  Memory: FLASH       X2        
5  Memory: FLASH       X3   FLASH
6  Memory: FLASH       X3        
7  Memory: FLASH       X4   SWAP.
8  Memory: FLASH       X4        
9  Memory: FLASH       X5   FLASH
10 Memory: FLASH       X5   SWAP.

I'd like to learn to program more fluently in R dealing with list/vector with R's equivalent to list comprehension, and that equivalent to list concatenation, as well as converting list expression to data frame properly.

Or I may need to learn R's paradigm of dealing with data with such complexity, or irregularity. (R's very elegant in dealing neatly structured vectors.)

Or maybe, I should use the right tool for the right job that such lower level data wrangling might not be good candidate with R?

Thanks for your help or pointers!

Yu

Here are two alternatives to consider.

The first uses cSplit from my "splitstackshape" package. It returns a data.table :

library(splitstackshape)
cSplit(Multiple_rows, "VARIANTS", ",", "long")[VARIANTS != ""]
#    VARIANTS      STANDARD
# 1:    FLASH Memory: FLASH
# 2:    SWAP. Memory: FLASH
# 3:    FLASH Memory: FLASH
# 4:    SWAP. Memory: FLASH

The second uses "dplyr" and "tidyr", with "stringi" loaded for trimming the strings:

library(dplyr)
library(tidyr)
library(stringi)

Multiple_rows %>%
  mutate(VARIANTS = lapply(strsplit(as.character(VARIANTS), ","), stri_trim)) %>%
  unnest(VARIANTS) %>%
  filter(VARIANTS != "")
#   VARIANTS      STANDARD
# 1    FLASH Memory: FLASH
# 2    SWAP. Memory: FLASH
# 3    FLASH Memory: FLASH
# 4    SWAP. Memory: FLASH

Try this:

# replace spaces with blanks in VARIANTS column
Multiple_rows$VARIANTS <- gsub(" ", "", as.character(Multiple_rows$VARIANTS))
# replace repeated commas with a single comma
Multiple_rows$VARIANTS <- gsub(",+", ",", as.character(Multiple_rows$VARIANTS))

VARIANTS <- unlist(strsplit(Multiple_rows$VARIANTS, ","))
STANDARD <- rep(Multiple_rows$STANDARD, 
                sapply(strsplit(Multiple_rows$VARIANTS, ","), length))

Multiple_rows <- data.frame(VARIANTS, STANDARD)
#  VARIANTS      STANDARD
#1    FLASH Memory: FLASH
#2    SWAP. Memory: FLASH
#3    FLASH Memory: FLASH
#4    SWAP. Memory: FLASH

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM