How to automate hierarchical grouping of variables based on variable name

Question

I have my variables named in little-endian fashion, separated by periods.

I'd like to create index variables for each different level and get summary output for the variables at each level, but I'm getting stuck at the first step trying to break apart my variables and put them in a table to start working with them:

Variable naming convention:

Environment.Construct.Subconstruct_1.subconstruct_i.#.Short_Name

Example:

n <- 6
dat <- data.frame(
  ph1.career_interest.delight.1.Friendly=sample(1:5, n, replace=TRUE),
  ph1.career_interest.delight.2.Advantagious=sample(1:5, n, replace=TRUE),
  ph1.career_interest.philosophy.1.Meaningful_Difference=sample(1:5, n, replace=TRUE),
  ph1.career_interest.philosophy.2.Enable_Work=sample(1:5, n, replace=TRUE)
)

# create list of variable names
names <-  as.list(colnames( dat ))
## Try to create a heirarchy of variables: Step 1: Create matrix
heir <- as.matrix(strsplit(names,".", fixed = TRUE))

I've gone through a couple iterations but it still returns an error:

Error in strsplit(names, ".", fixed = TRUE) : non-character argument

Answer 1

Instead of wrapping with as.list , directly use the colnames because according to ?strsplit , the input x would be

x - character vector, each element of which is to be split. Other inputs, including a factor, will give an error.

Thus, if it is a list , it is not the expected input class for strsplit

nm1 <- colnames(dat)
strsplit(nm1, ".", fixed = TRUE)
#[[1]]
#[1] "ph1"             "career_interest" "delight"         "1"               "Friendly"       

#[[2]]
#[1] "ph1"             "career_interest" "delight"         "2"               "Advantagious"   

#[[3]]
#[1] "ph1"                   "career_interest"       "philosophy"            "1"                     "Meaningful_Difference"

#[[4]]
#[1] "ph1"             "career_interest" "philosophy"      "2"               "Enable_Work"

Output is a list of vector s. It is not clear from the OP's post about the expected output format. If we need a matrix or data.frame , can rbind those list elements (assuming they have the same length )

 m1 <-  do.call(rbind, strsplit(nm1, ".", fixed = TRUE))

returns a matrix

Or can convert to data.frame with rbind.data.frame

NOTE: names is a function name. It is better not to assign object names with function names

Update

If the lengths are not the same, an option is to pad NA at the end for those elements with less length

lst1 <- strsplit(nm1, ".", fixed = TRUE)
lst1[[1]] <- lst1[[1]][1:3] # making lengths different
mx  <- max(lengths(lst1))
do.call(rbind, lapply(lst1, `length<-`, mx))
#   [,1]  [,2]              [,3]         [,4] [,5]                   
#[1,] "ph1" "career_interest" "delight"    NA   NA                     
#[2,] "ph1" "career_interest" "delight"    "2"  "Advantagious"         
#[3,] "ph1" "career_interest" "philosophy" "1"  "Meaningful_Difference"
#[4,] "ph1" "career_interest" "philosophy" "2"  "Enable_Work"

Answer 2

You can count number of '.'in the column names to count number of new columns to create. We can then use tidyr::separate to divide data into n new columns splitting on . .

#Changing 1st column name to make length unequal
names(dat)[1] <- 'ph1.career_interest.delight.1'
#Number of new columns to be created
n <- max(stringr::str_count(names(dat), '\\.')) + 1
tidyr::separate(data.frame(name = names(dat)), name, 
                paste0('col', seq_len(n)), sep = '\\.', fill = 'right')

#  col1            col2       col3 col4                  col5
#1  ph1 career_interest    delight    1                  <NA>
#2  ph1 career_interest    delight    2          Advantagious
#3  ph1 career_interest philosophy    1 Meaningful_Difference
#4  ph1 career_interest philosophy    2           Enable_Work

How to automate hierarchical grouping of variables based on variable name

Question

2 answers

solution1
1 ACCPTED 2020-12-30 21:40:32

Update

solution2
1 2020-12-31 04:33:29

How to automate hierarchical grouping of variables based on variable name

Question

2 answers

solution1 1 ACCPTED 2020-12-30 21:40:32

Update

solution2 1 2020-12-31 04:33:29

solution1
1 ACCPTED 2020-12-30 21:40:32

solution2
1 2020-12-31 04:33:29