[英]Camel case string from all caps
I have a database with many thousands of tables and columns.我有一个包含数千个表和列的数据库。 The column names are consistently in all caps eg BOOKINGPROCESSNOTEADDED, BOOKEDDATETIME, BOOKINGSTATUS.
列名始终采用所有大写字母,例如 BOOKINGPROCESSNOTEADDED、BOOKEDDATETIME、BOOKINGSTATUS。 I wish to rename columns so they are lower camel case eg BOOKINGSTATUS -> booking_status
我希望重命名列,以便它们是小写的驼峰式,例如 BOOKINGSTATUS -> booking_status
Because there is no variation in case between words, spaces or underscores, it's essentially impossible to apply more traditional methods to convert strings into different cases (eg using R's snakecase
package).因为单词、空格或下划线之间的大小写没有变化,所以基本上不可能应用更传统的方法将字符串转换为不同的大小写(例如使用 R 的
snakecase
包)。 I was wondering if it's possible to instead apply some sort of English language dictionary lookup on each string and return splits.我想知道是否可以在每个字符串上应用某种英语词典查找并返回拆分。
Taking the BOOKINGSTATUS example above, a return could be: boo_king_status, boo_king_stat_us and booking_status.以上面的 BOOKINGSTATUS 为例,返回值可能是:boo_king_status、boo_king_stat_us 和 booking_status。 Specifying the minimum length of words would be useful.
指定单词的最小长度会很有用。 If the minimum is set to 4 letters, then only booking_status would be returned in this example (because 'boo' is only 3-letters long and 'us' only 2-letters long)
如果最小值设置为 4 个字母,则在此示例中仅返回 booking_status(因为“boo”只有 3 个字母,“us”只有 2 个字母)
It's quite possible that a brute-force method is too computationally expensive, but wanted to ask in case there is a reasonably efficient method to do this.蛮力方法很可能在计算上过于昂贵,但想问一下是否有合理有效的方法来做到这一点。 A Python or R solution would be most welcome.
Python 或 R 解决方案将是最受欢迎的。
library(dplyr)
library(stringr)
add_spaces <- function(colnames, words) {
for(i in 1:length(colnames)) {
for(j in words) {
if(str_detect(string = colnames[i], pattern = j)) {
colnames[i] <- str_replace(string = colnames[i], j, glue::glue("{str_to_lower(j)}_"))
}
}
}
colnames <- colnames %>%
str_remove("\\_+$") # Remove hyphens at the end
message("Characters not identified: ")
print(str_remove_all(colnames, "[a-z_]"))
invisible(colnames)
}
colnames <- names(<file>) # Capture colnames # Using c("BOOKINGPROCESSNOTEADDED", "BOOKEDDATETIME", "BOOKINGSTATUS")
words <- c("BOOKING", "BOOKED", "PROCESS") # Create first list of words
colnames <- add_spaces(colnames, words) # Run the first iteration
> Characters not identified:
[1] "NOTEADDED" "DATETIME" "STATUS"
words <- c(words, "NOTE", "ADDED", "DATE", "TIME", "STATUS") # Augment list with missing words
colnames <- add_spaces(colnames, words) # Rerun, ... repeat as needed
colnames
[1] "booking_process_note_added" "booked_date_time" "booking_status"
Here's a sloppy, brute-force, imperfect attempt.这是一个草率的、蛮力的、不完美的尝试。 It will almost certainly miss something.
它几乎肯定会错过一些东西。 In fact, it's more a conversation about the process, hoping that you can build a better "dictionary".
其实更多的是关于过程的对话,希望你能建立一个更好的“词典”。
First, a discussion about this "dictionary": ideally it should contain a word and its plural, *ing
, and *ed
forms.首先,关于这个“字典”的讨论:理想情况下,它应该包含一个单词及其复数
*ing
和*ed
forms。 We'll be attempting to replace each word with a snake-wrapped ( _word_
) version, so we'll go in reverse-order based on length.我们将尝试用蛇形包装(
_word_
)版本替换每个单词,因此我们将根据长度以相反的顺序排列 go。 For sanity, we should probably remove too-short words ( and
, an
, a
), so let's start with stringr::words
(simply a "sample character vectors for practicing string manipulations" , not a great start).出于理智,我们可能应该删除太短的单词(
and
, an
, a
),所以让我们从stringr::words
开始(只是“用于练习字符串操作的示例字符向量” ,不是一个好的开始)。
words <- stringr::words[ order(nchar(stringr::words), decreasing = TRUE) ]
# see words[nchar(words) < 4] for what we are removing here
words <- words[nchar(words) > 3]
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
toupper(words), init = vec)
# [1] "_BOOK_ING_PROCESS__NOTE_ADDED" "_BOOK_ED_DATE__TIME_" "_BOOK_INGSTATUS"
That looks odd, certainly.这看起来很奇怪,当然。 We can note that some of the words we know are in our vector are missing in
stringr::words
:我们可以注意到,我们知道向量中的一些单词在
stringr::words
中丢失了:
c("booking", "process", "status") %in% words
# [1] FALSE TRUE FALSE
We can augment our list:我们可以扩充我们的列表:
words2 <- c(words, "booking", "booked", "status")
words2 <- words2[ order(nchar(words2), decreasing = TRUE) ]
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
toupper(words2), init = vec)
# [1] "__BOOK_ING__PROCESS__NOTE_ADDED" "__BOOK_ED__DATE__TIME_" "__BOOK_ING__STATUS_"
The issue here is that since we have both "booking"
and "book"
, it will always double-change "BOOKING"
.这里的问题是,由于我们同时拥有
"booking"
和"book"
,它总是会双重更改"BOOKING"
。 Given my naïve start here, I don't know that there's an easy quick-patch other than to remove "book"
(and "king"
, incidentally).鉴于我从这里开始的天真,我不知道除了删除
"book"
(顺便说一句,还有"king"
)之外还有一个简单的快速补丁。
words3 <- setdiff(words2, c("book", "king"))
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
toupper(words3), init = vec)
# [1] "_BOOKING__PROCESS__NOTE_ADDED" "_BOOKED__DATE__TIME_" "_BOOKING__STATUS_"
From here, we can remove leading/trailing and double _
.从这里,我们可以删除前导/尾随和双
_
。
gsub("__", "_",
gsub("^_|_$", "",
Reduce(function(txt, ptn) gsub(ptn, paste0("_", ptn ,"_"), txt, perl = TRUE),
toupper(words3), init = vec)))
# [1] "BOOKING_PROCESS_NOTE_ADDED" "BOOKED_DATE_TIME" "BOOKING_STATUS"
The quality is completely dependent on starting with a good dictionary.质量完全取决于从一本好的词典开始。 If all of your UPPERNOSPACEWORDS are well defined, then perhaps you can build it manually.
如果您的所有 UPPERNOSPACEWORDS 都已明确定义,那么也许您可以手动构建它。 (Note that some words may just self-isolate because there is a known word both before and after (note that
"added"
is not in words3
but it is still broken out). (请注意,有些单词可能只是自我隔离,因为前后都有一个已知单词(请注意,
"added"
不在words3
中,但它仍然被分解)。
I would build the dictionary manually:我会手动构建字典:
Repeat the last 3 steps until all words are split.重复最后 3 个步骤,直到所有单词都被拆分。 For example, with the 3 names you posted, the dictionary would first get
c("booking", "status")
, and that name would have no uppercase.例如,对于您发布的 3 个名称,字典将首先获取
c("booking", "status")
,并且该名称没有大写字母。 The name BOOKINGPROCESSNOTEADDED
would become booking_PROCESSNOTEADDED
;名称
BOOKINGPROCESSNOTEADDED
将变为booking_PROCESSNOTEADDED
; if you chose that, you'd add c("process", "note", "added")
to the dictionary, and find BOOKEDDATETIME
next.如果你选择了这个,你会添加
c("process", "note", "added")
到字典中,然后找到BOOKEDDATETIME
。 Now you need to decide on the words: is it c("booked", "date", "time")
or c("booked", "datetime")
?现在您需要确定单词:是
c("booked", "date", "time")
还是c("booked", "datetime")
?
And so on.等等。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.