Extracting numerical values from the column names of a data.frame

Question

I have data as follows:

library(magrittr)
dat_I <- structure(list(`[0,25)` = c(0L, 2L, 252L, 3L, 34L, 0L, 2L, 65L, 
23L, 9L, 84L, 24L, 52L, 5L, 1L, 91L, 5L, 4L, 7L, 5L, 40L, 116L, 
12L), `[1000,1500)` = c(0L, 12L, 16L, 0L, 34L, 1L, 0L, 7L, 0L, 
0L, 2L, 0L, 4L, 11L, 1L, 0L, 0L, 6L, 8L, 0L, 2L, 8L, 0L), `[1500,1000000)` = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0), `[1500,3000)` = c(8L, 5L, 8L, 0L, 16L, 2L, 10L, 4L, 5L, 0L, 
4L, 3L, 0L, 6L, 4L, 0L, 49L, 7L, 6L, 0L, 1L, 2L, 0L), `[25,1000)` = c(0L, 
22L, 48L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 25L, 27L, 0L, 0L, 28L, 0L), `[25,1500)` = c(15L, 0L, 0L, 
0L, 0L, 0L, 23L, 0L, 23L, 0L, 0L, 25L, 0L, 0L, 0L, 0L, 5L, 0L, 
0L, 0L, 0L, 0L, 0L), `[25,250)` = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 42L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L), `[25,3000)` = c(0L, 0L, 0L, 33L, 0L, 0L, 0L, 0L, 0L, 63L, 
0L, 0L, 0L, 0L, 0L, 29L, 0L, 0L, 0L, 34L, 0L, 0L, 83L), `[25,500)` = c(0L, 
0L, 0L, 0L, 213L, 24L, 0L, 23L, 0L, 0L, 25L, 0L, 21L, 107L, 0L, 
0L, 0L, 0L, 0L, 0L, 23L, 0L, 0L), `[250,500)` = c(0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), `[3000,1000000)` = c(2L, 1L, 1L, 7L, 1L, 0L, 
2L, 1L, 5L, 25L, 5L, 1L, 0L, 3L, 0L, 4L, 7L, 2L, 5L, 17L, 0L, 
5L, 19L), `[500,1000)` = c(0L, 0L, 0L, 0L, 122L, 9L, 0L, 11L, 
0L, 0L, 7L, 0L, 6L, 44L, 3L, 0L, 0L, 0L, 0L, 0L, 7L, 0L, 0L)), class = "data.frame", row.names = c("A", 
"B", "C", "D", 
"E", "F", "G", 
"H", "I", "J", "K", 
"L", "M", "N", 
"O", "P", "Q", 
"R", "S", "T", "U", 
"V", "W"))

dat_II <- structure(list(`[0,25)` = 5L, `[100,250)` = 43L, `[100,500)` = 0L, 
    `[1000,1000000]` = 20L, `[1000,1500)` = 0L, `[1500,3000)` = 0L, 
    `[25,100)` = 38L, `[25,50)` = 0L, `[250,500)` = 27L, `[3000,1000000]` = 0L, 
    `[50,100)` = 0L, `[500,1000)` = 44L, `[500,1000000]` = 0L), row.names = "Type_A", class = "data.frame")

I would like to apply the following code:

s_ordered_II <- stringi::stri_extract_all_regex(colnames(dat_II), "[[:alpha:]]+") %>%
  unlist() %>% 
  unique() %>% 
  sort()

s_ordered_I <- stringi::stri_extract_all_regex(colnames(dat_I), "[[:alpha:]]+") %>%
  unlist() %>% 
  unique() %>% 
  sort()

For some reason it does not work although it did with similar code before . I do not understand why.

Could someone comment?

Answer 1

You're using "[[:alpha:]]+" which will find all alphabeta characters (a combination of [:lower:] and [:upper:] ). If you want numbers, you should be using "[[:digit:]]+" (or "[[:alnum:]]+" ) instead. See ?regex for all of them but these two:

     '[:alpha:]' Alphabetic characters: '[:lower:]' and '[:upper:]'.
     '[:digit:]' Digits: '0 1 2 3 4 5 6 7 8 9'.

With that,

stringi::stri_extract_all_regex(colnames(dat_II), "[[:digit:]]+") %>%
  unlist() %>% 
  unique() %>% 
  sort()
#  [1] "0"       "100"     "1000"    "1000000" "1500"    "25"      "250"     "3000"    "50"      "500"    

stringi::stri_extract_all_regex(colnames(dat_I), "[[:digit:]]+") %>%
  unlist() %>% 
  unique() %>% 
  sort()
# [1] "0"       "1000"    "1000000" "1500"    "25"      "250"     "3000"    "500"

Though that does lose the pairing of (say) of [0,25) ...

Extracting numerical values from the column names of a data.frame

Question

1 answers

solution1
1 2022-06-29 19:53:23

Extracting numerical values from the column names of a data.frame

Question

1 answers

solution1 1 2022-06-29 19:53:23

solution1
1 2022-06-29 19:53:23