I'm struggling to implement the following requirement in R:
For example, for the following dataframe:
Multiple_rows <- data.frame(rbind(c("FLASH, SWAP.", "Memory: FLASH"), c("FLASH, , ,, SWAP.", "Memory: FLASH")))
colnames(Multiple_rows)<- c("VARIANTS", "STANDARD")
Multiple_rows
# VARIANTS STANDARD
#1 FLASH, SWAP. Memory: FLASH
#2 FLASH, , ,, SWAP. Memory: FLASH
So for the above example, I'd like to have the following as the result:
# VARIANT STANDARD
#1 "FLASH" "Memory FLASH"
#2 "SWAP." "Memory FLASH"
#3 "FLASH" "Memory FLASH"
#4 "SWAP." "Memory FLASH"
The order of the rows does not matter.
Below is my implementation of in Clojure (to illustrate my requirement):
(def Multple-rows
[{:VARIANTS "FLASH, SWAP.", :STANDARD "Memory: FLASH"}
{:VARIANTS "FLASH, , ,, SWAP.", :STANDARD "Memory: FLASH"}]) ;; This is my input. The input equivalent to a data frame with 2 column of "STANDARD", and "VARIANTS"
(defn variants-decomposed [a_map_raw] ;; process each row of the input data
(if-let [variants (:VARIANTS a_map_raw)]
(if (clojure.string/blank? variants)
[{:STANDARD (:STANDARD a_map_raw), :VARIANT nil}]
(let [standard (:STANDARD a_map_raw)
splitted (-> (clojure.string/split variants #"[,]")
((fn [list-variant] (map #(clojure.string/trim %) list-variant)), )
((fn [list-variant] (filter #(not (clojure.string/blank? %)) list-variant)), ))]
(if (seq splitted) ;; not empty
(for [v splitted] {:STANDARD standard, :VARIANT v})
[{:STANDARD standard, :VARIANT nil}]
)))
[{:VARIANT nil, :STANDARD (:STANDARD a_map_raw)}])
)
(defn multiple-variant-maps [map_of_variants] ;; the processing to each row and collect the result
(-> (map variants-decomposed map_of_variants)
((fn [list-of-vectors] (apply concat list-of-vectors)), )))))
(multiple-variant-maps Multple-rows) ;; This is my required result, which is equivalent to a data frame of 2 columns of "STANDARD", and "VARIANT".
Here are the results of the above computation:
({:STANDARD "Memory: FLASH", :VARIANT "FLASH"}
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."}
{:STANDARD "Memory: FLASH", :VARIANT "FLASH"}
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."})
I wish that I could do the equivalent in R idiomatically. So far, I have struggled to get the following, but it still does not handle the irregularity of blank variants, etc.
dictionary.cleaned <- function(t) {
variants.splitted <- sapply(data.frame(do.call('rbind', strsplit(t[, "VARIANTS"], "[,]"))), str_trim)
melted <- melt(data.frame(dplyr::select(t, -VARIANTS), variants.splitted), id.vars = "STANDARD")
colnames(melted)[colnames(melted)== "value"] <- "VARIANT"
melted
}
Here are the result of the above R code:
> dictionary.cleaned(Multiple_rows)
STANDARD variable VARIANT
1 Memory: FLASH X1 FLASH
2 Memory: FLASH X1 FLASH
3 Memory: FLASH X2 SWAP.
4 Memory: FLASH X2
5 Memory: FLASH X3 FLASH
6 Memory: FLASH X3
7 Memory: FLASH X4 SWAP.
8 Memory: FLASH X4
9 Memory: FLASH X5 FLASH
10 Memory: FLASH X5 SWAP.
I'd like to learn to program more fluently in R dealing with list/vector with R's equivalent to list comprehension, and that equivalent to list concatenation, as well as converting list expression to data frame properly.
Or I may need to learn R's paradigm of dealing with data with such complexity, or irregularity. (R's very elegant in dealing neatly structured vectors.)
Or maybe, I should use the right tool for the right job that such lower level data wrangling might not be good candidate with R?
Thanks for your help or pointers!
Yu
Here are two alternatives to consider.
The first uses cSplit
from my "splitstackshape" package. It returns a data.table
:
library(splitstackshape)
cSplit(Multiple_rows, "VARIANTS", ",", "long")[VARIANTS != ""]
# VARIANTS STANDARD
# 1: FLASH Memory: FLASH
# 2: SWAP. Memory: FLASH
# 3: FLASH Memory: FLASH
# 4: SWAP. Memory: FLASH
The second uses "dplyr" and "tidyr", with "stringi" loaded for trimming the strings:
library(dplyr)
library(tidyr)
library(stringi)
Multiple_rows %>%
mutate(VARIANTS = lapply(strsplit(as.character(VARIANTS), ","), stri_trim)) %>%
unnest(VARIANTS) %>%
filter(VARIANTS != "")
# VARIANTS STANDARD
# 1 FLASH Memory: FLASH
# 2 SWAP. Memory: FLASH
# 3 FLASH Memory: FLASH
# 4 SWAP. Memory: FLASH
Try this:
# replace spaces with blanks in VARIANTS column
Multiple_rows$VARIANTS <- gsub(" ", "", as.character(Multiple_rows$VARIANTS))
# replace repeated commas with a single comma
Multiple_rows$VARIANTS <- gsub(",+", ",", as.character(Multiple_rows$VARIANTS))
VARIANTS <- unlist(strsplit(Multiple_rows$VARIANTS, ","))
STANDARD <- rep(Multiple_rows$STANDARD,
sapply(strsplit(Multiple_rows$VARIANTS, ","), length))
Multiple_rows <- data.frame(VARIANTS, STANDARD)
# VARIANTS STANDARD
#1 FLASH Memory: FLASH
#2 SWAP. Memory: FLASH
#3 FLASH Memory: FLASH
#4 SWAP. Memory: FLASH
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.