所有大写的驼峰式字符串

Question

I have a database with many thousands of tables and columns.我有一个包含数千个表和列的数据库。 The column names are consistently in all caps eg BOOKINGPROCESSNOTEADDED, BOOKEDDATETIME, BOOKINGSTATUS.列名始终采用所有大写字母，例如 BOOKINGPROCESSNOTEADDED、BOOKEDDATETIME、BOOKINGSTATUS。 I wish to rename columns so they are lower camel case eg BOOKINGSTATUS -> booking_status我希望重命名列，以便它们是小写的驼峰式，例如 BOOKINGSTATUS -> booking_status

Because there is no variation in case between words, spaces or underscores, it's essentially impossible to apply more traditional methods to convert strings into different cases (eg using R's snakecase package).因为单词、空格或下划线之间的大小写没有变化，所以基本上不可能应用更传统的方法将字符串转换为不同的大小写（例如使用 R 的snakecase包）。 I was wondering if it's possible to instead apply some sort of English language dictionary lookup on each string and return splits.我想知道是否可以在每个字符串上应用某种英语词典查找并返回拆分。

Taking the BOOKINGSTATUS example above, a return could be: boo_king_status, boo_king_stat_us and booking_status.以上面的 BOOKINGSTATUS 为例，返回值可能是：boo_king_status、boo_king_stat_us 和 booking_status。 Specifying the minimum length of words would be useful.指定单词的最小长度会很有用。 If the minimum is set to 4 letters, then only booking_status would be returned in this example (because 'boo' is only 3-letters long and 'us' only 2-letters long)如果最小值设置为 4 个字母，则在此示例中仅返回 booking_status（因为“boo”只有 3 个字母，“us”只有 2 个字母）

It's quite possible that a brute-force method is too computationally expensive, but wanted to ask in case there is a reasonably efficient method to do this.蛮力方法很可能在计算上过于昂贵，但想问一下是否有合理有效的方法来做到这一点。 A Python or R solution would be most welcome. Python 或 R 解决方案将是最受欢迎的。

Answer 1

library(dplyr)
library(stringr)

add_spaces <- function(colnames, words) {
  for(i in 1:length(colnames)) {
    for(j in words) {
      if(str_detect(string = colnames[i], pattern = j)) {
        colnames[i] <- str_replace(string = colnames[i], j, glue::glue("{str_to_lower(j)}_"))
      }
    }
  }
 
  colnames <- colnames %>% 
    str_remove("\\_+$") # Remove hyphens at the end
  
  message("Characters not identified: ")
  print(str_remove_all(colnames, "[a-z_]"))
  
  invisible(colnames)
}
   
colnames <- names(<file>) # Capture colnames  # Using c("BOOKINGPROCESSNOTEADDED", "BOOKEDDATETIME", "BOOKINGSTATUS")

words <- c("BOOKING", "BOOKED", "PROCESS") # Create first list of words

colnames <- add_spaces(colnames, words) # Run the first iteration        

> Characters not identified:
[1] "NOTEADDED" "DATETIME"  "STATUS"

words <- c(words, "NOTE", "ADDED", "DATE", "TIME", "STATUS") # Augment list with missing words

colnames <- add_spaces(colnames, words) # Rerun, ... repeat as needed
    
colnames 

[1] "booking_process_note_added" "booked_date_time"           "booking_status"

Answer 2

Here's a sloppy, brute-force, imperfect attempt.这是一个草率的、蛮力的、不完美的尝试。 It will almost certainly miss something.它几乎肯定会错过一些东西。 In fact, it's more a conversation about the process, hoping that you can build a better "dictionary".其实更多的是关于过程的对话，希望你能建立一个更好的“词典”。

First, a discussion about this "dictionary": ideally it should contain a word and its plural, *ing , and *ed forms.首先，关于这个“字典”的讨论：理想情况下，它应该包含一个单词及其复数*ing和*ed forms。 We'll be attempting to replace each word with a snake-wrapped ( _word_ ) version, so we'll go in reverse-order based on length.我们将尝试用蛇形包装（ _word_ ）版本替换每个单词，因此我们将根据长度以相反的顺序排列 go。 For sanity, we should probably remove too-short words ( and , an , a ), so let's start with stringr::words (simply a "sample character vectors for practicing string manipulations" , not a great start).出于理智，我们可能应该删除太短的单词（ and , an , a ），所以让我们从stringr::words开始（只是“用于练习字符串操作的示例字符向量” ，不是一个好的开始）。

words <- stringr::words[ order(nchar(stringr::words), decreasing = TRUE) ]
# see words[nchar(words) < 4] for what we are removing here
words <- words[nchar(words) > 3]
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
       toupper(words), init = vec)
# [1] "_BOOK_ING_PROCESS__NOTE_ADDED" "_BOOK_ED_DATE__TIME_"          "_BOOK_INGSTATUS"

That looks odd, certainly.这看起来很奇怪，当然。 We can note that some of the words we know are in our vector are missing in stringr::words :我们可以注意到，我们知道向量中的一些单词在stringr::words中丢失了：

c("booking", "process", "status") %in% words
# [1] FALSE  TRUE FALSE

We can augment our list:我们可以扩充我们的列表：

words2 <- c(words, "booking", "booked", "status")
words2 <- words2[ order(nchar(words2), decreasing = TRUE) ]
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
       toupper(words2), init = vec)
# [1] "__BOOK_ING__PROCESS__NOTE_ADDED" "__BOOK_ED__DATE__TIME_"          "__BOOK_ING__STATUS_"

The issue here is that since we have both "booking" and "book" , it will always double-change "BOOKING" .这里的问题是，由于我们同时拥有"booking"和"book" ，它总是会双重更改"BOOKING" 。 Given my naïve start here, I don't know that there's an easy quick-patch other than to remove "book" (and "king" , incidentally).鉴于我从这里开始的天真，我不知道除了删除"book" （顺便说一句，还有"king" ）之外还有一个简单的快速补丁。

words3 <- setdiff(words2, c("book", "king"))
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
       toupper(words3), init = vec)
# [1] "_BOOKING__PROCESS__NOTE_ADDED" "_BOOKED__DATE__TIME_"          "_BOOKING__STATUS_"

From here, we can remove leading/trailing and double _ .从这里，我们可以删除前导/尾随和双_ 。

gsub("__", "_",
     gsub("^_|_$", "", 
          Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
                 toupper(words3), init = vec)))
# [1] "BOOKING_PROCESS_NOTE_ADDED" "BOOKED_DATE_TIME"           "BOOKING_STATUS"

The quality is completely dependent on starting with a good dictionary.质量完全取决于从一本好的词典开始。 If all of your UPPERNOSPACEWORDS are well defined, then perhaps you can build it manually.如果您的所有 UPPERNOSPACEWORDS 都已明确定义，那么也许您可以手动构建它。 (Note that some words may just self-isolate because there is a known word both before and after (note that "added" is not in words3 but it is still broken out). （请注意，有些单词可能只是自我隔离，因为前后都有一个已知单词（请注意， "added"不在words3中，但它仍然被分解）。

Answer 3

I would build the dictionary manually:我会手动构建字典：

Start with an empty dictionary从空字典开始
Get all names获取所有名称
Find one containing uppercase找到一个包含大写的
Manually add words to the dictionary to split that one手动将单词添加到字典中以拆分该单词
Split all of them using the current dictionary使用当前字典将它们全部拆分

Repeat the last 3 steps until all words are split.重复最后 3 个步骤，直到所有单词都被拆分。 For example, with the 3 names you posted, the dictionary would first get c("booking", "status") , and that name would have no uppercase.例如，对于您发布的 3 个名称，字典将首先获取c("booking", "status") ，并且该名称没有大写字母。 The name BOOKINGPROCESSNOTEADDED would become booking_PROCESSNOTEADDED ;名称BOOKINGPROCESSNOTEADDED将变为booking_PROCESSNOTEADDED ； if you chose that, you'd add c("process", "note", "added") to the dictionary, and find BOOKEDDATETIME next.如果你选择了这个，你会添加c("process", "note", "added")到字典中，然后找到BOOKEDDATETIME 。 Now you need to decide on the words: is it c("booked", "date", "time") or c("booked", "datetime") ?现在您需要确定单词：是c("booked", "date", "time")还是c("booked", "datetime") ？

And so on.等等。

所有大写的驼峰式字符串

问题描述

3 个解决方案

解决方案1
2 2022-01-07 13:05:27

解决方案2
2 2022-01-07 13:34:48

解决方案3
1 2022-01-07 12:56:05

所有大写的驼峰式字符串

问题描述

3 个解决方案

解决方案1 2 2022-01-07 13:05:27

解决方案2 2 2022-01-07 13:34:48

解决方案3 1 2022-01-07 12:56:05

解决方案1
2 2022-01-07 13:05:27

解决方案2
2 2022-01-07 13:34:48

解决方案3
1 2022-01-07 12:56:05