简体   繁体   中英

Camel case string from all caps

I have a database with many thousands of tables and columns. The column names are consistently in all caps eg BOOKINGPROCESSNOTEADDED, BOOKEDDATETIME, BOOKINGSTATUS. I wish to rename columns so they are lower camel case eg BOOKINGSTATUS -> booking_status

Because there is no variation in case between words, spaces or underscores, it's essentially impossible to apply more traditional methods to convert strings into different cases (eg using R's snakecase package). I was wondering if it's possible to instead apply some sort of English language dictionary lookup on each string and return splits.

Taking the BOOKINGSTATUS example above, a return could be: boo_king_status, boo_king_stat_us and booking_status. Specifying the minimum length of words would be useful. If the minimum is set to 4 letters, then only booking_status would be returned in this example (because 'boo' is only 3-letters long and 'us' only 2-letters long)

It's quite possible that a brute-force method is too computationally expensive, but wanted to ask in case there is a reasonably efficient method to do this. A Python or R solution would be most welcome.

library(dplyr)
library(stringr)

add_spaces <- function(colnames, words) {
  for(i in 1:length(colnames)) {
    for(j in words) {
      if(str_detect(string = colnames[i], pattern = j)) {
        colnames[i] <- str_replace(string = colnames[i], j, glue::glue("{str_to_lower(j)}_"))
      }
    }
  }
 
  colnames <- colnames %>% 
    str_remove("\\_+$") # Remove hyphens at the end
  
  message("Characters not identified: ")
  print(str_remove_all(colnames, "[a-z_]"))
  
  invisible(colnames)
}
   
colnames <- names(<file>) # Capture colnames  # Using c("BOOKINGPROCESSNOTEADDED", "BOOKEDDATETIME", "BOOKINGSTATUS")

words <- c("BOOKING", "BOOKED", "PROCESS") # Create first list of words

colnames <- add_spaces(colnames, words) # Run the first iteration        

> Characters not identified:
[1] "NOTEADDED" "DATETIME"  "STATUS"

words <- c(words, "NOTE", "ADDED", "DATE", "TIME", "STATUS") # Augment list with missing words

colnames <- add_spaces(colnames, words) # Rerun, ... repeat as needed
    
colnames 

[1] "booking_process_note_added" "booked_date_time"           "booking_status" 

Here's a sloppy, brute-force, imperfect attempt. It will almost certainly miss something. In fact, it's more a conversation about the process, hoping that you can build a better "dictionary".

First, a discussion about this "dictionary": ideally it should contain a word and its plural, *ing , and *ed forms. We'll be attempting to replace each word with a snake-wrapped ( _word_ ) version, so we'll go in reverse-order based on length. For sanity, we should probably remove too-short words ( and , an , a ), so let's start with stringr::words (simply a "sample character vectors for practicing string manipulations" , not a great start).

words <- stringr::words[ order(nchar(stringr::words), decreasing = TRUE) ]
# see words[nchar(words) < 4] for what we are removing here
words <- words[nchar(words) > 3]
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
       toupper(words), init = vec)
# [1] "_BOOK_ING_PROCESS__NOTE_ADDED" "_BOOK_ED_DATE__TIME_"          "_BOOK_INGSTATUS"              

That looks odd, certainly. We can note that some of the words we know are in our vector are missing in stringr::words :

c("booking", "process", "status") %in% words
# [1] FALSE  TRUE FALSE

We can augment our list:

words2 <- c(words, "booking", "booked", "status")
words2 <- words2[ order(nchar(words2), decreasing = TRUE) ]
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
       toupper(words2), init = vec)
# [1] "__BOOK_ING__PROCESS__NOTE_ADDED" "__BOOK_ED__DATE__TIME_"          "__BOOK_ING__STATUS_"            

The issue here is that since we have both "booking" and "book" , it will always double-change "BOOKING" . Given my naïve start here, I don't know that there's an easy quick-patch other than to remove "book" (and "king" , incidentally).

words3 <- setdiff(words2, c("book", "king"))
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
       toupper(words3), init = vec)
# [1] "_BOOKING__PROCESS__NOTE_ADDED" "_BOOKED__DATE__TIME_"          "_BOOKING__STATUS_"            

From here, we can remove leading/trailing and double _ .

gsub("__", "_",
     gsub("^_|_$", "", 
          Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
                 toupper(words3), init = vec)))
# [1] "BOOKING_PROCESS_NOTE_ADDED" "BOOKED_DATE_TIME"           "BOOKING_STATUS"            

The quality is completely dependent on starting with a good dictionary. If all of your UPPERNOSPACEWORDS are well defined, then perhaps you can build it manually. (Note that some words may just self-isolate because there is a known word both before and after (note that "added" is not in words3 but it is still broken out).

I would build the dictionary manually:

  • Start with an empty dictionary
  • Get all names
  • Find one containing uppercase
  • Manually add words to the dictionary to split that one
  • Split all of them using the current dictionary

Repeat the last 3 steps until all words are split. For example, with the 3 names you posted, the dictionary would first get c("booking", "status") , and that name would have no uppercase. The name BOOKINGPROCESSNOTEADDED would become booking_PROCESSNOTEADDED ; if you chose that, you'd add c("process", "note", "added") to the dictionary, and find BOOKEDDATETIME next. Now you need to decide on the words: is it c("booked", "date", "time") or c("booked", "datetime") ?

And so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM