[英]Create new variables based upon specific values
我閱讀了正則表達式和Hadley Wickham的stringr
和dplyr
包,但無法弄清楚如何dplyr
工作。
我在數據框中有圖書館流通數據,電話號碼作為字符變量。 我想取最初的大寫字母並將其作為一個新變量,將字母和句點之間的數字轉換為第二個新變量。
Call_Num
HV5822.H4 C47 Circulating Collection, 3rd Floor
QE511.4 .G53 1982 Circulating Collection, 3rd Floor
TL515 .M63 Circulating Collection, 3rd Floor
D753 .F4 Circulating Collection, 3rd Floor
DB89.F7 D4 Circulating Collection, 3rd Floor
使用stringi
包,這將是一個選項。 由於你的目標停留在字符串的開頭, stri_extract_first()
可以很好地工作。 [:alpha:]{1,}
表示包含多個字母的字母序列。 使用stri_extract_first()
,您可以識別第一個字母序列。 同樣,您可以使用stri_extract_first(x, regex = "\\\\d{1,}")
找到第一個數字序列。
x <- c("HV5822.H4 C47 Circulating Collection, 3rd Floor",
"QE511.4 .G53 1982 Circulating Collection, 3rd Floor",
"TL515 .M63 Circulating Collection, 3rd Floor",
"D753 .F4 Circulating Collection, 3rd Floor",
"DB89.F7 D4 Circulating Collection, 3rd Floor")
library(stringi)
data.frame(alpha = stri_extract_first(x, regex = "[:alpha:]{1,}"),
number = stri_extract_first(x, regex = "\\d{1,}"))
# alpha number
#1 HV 5822
#2 QE 511
#3 TL 515
#4 D 753
#5 DB 89
關於什么
rl <- read.table(header = TRUE, text = "Call_Num
'HV5822.H4 C47 Circulating Collection, 3rd Floor'
'QE511.4 .G53 1982 Circulating Collection, 3rd Floor'
'TL515 .M63 Circulating Collection, 3rd Floor'
'D753 .F4 Circulating Collection, 3rd Floor'
'DB89.F7 D4 Circulating Collection, 3rd Floor'",
stringsAsFactors = FALSE)
cbind(rl, read.table(text = gsub('([A-Z]+)([0-9]+).*', '\\1 \\2', rl$Call_Num)))
# Call_Num V1 V2
# 1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
# 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
# 3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
# 4 D753 .F4 Circulating Collection, 3rd Floor D 753
# 5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
如果你想使用stringr
,解決方案可能看起來像這樣:
df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))
require(stringr)
matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])
df2
## Call_Num letter number
## 1 HV5822.H4 C47 Circulating Collection, 3rd Floor HV 5822
## 2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor QE 511
## 3 TL515 .M63 Circulating Collection, 3rd Floor TL 515
## 4 D753 .F4 Circulating Collection, 3rd Floor D 753
## 5 DB89.F7 D4 Circulating Collection, 3rd Floor DB 89
我不認為將str_match()
調用粘貼到str_match()
mutate()
中是值得的,所以我就把它dplyr
。 或者使用rawr的解決方案。
您可以從gsubfn包strapply使用:
library(gsubfn)
m <- strapply(as.character(df$Call_Num), '^([A-Z]+)(\\d+)',
~ c(id = x, num = y), simplify = rbind)
X <- as.data.frame(m, stringsAsFactors = FALSE)
# id num
# 1 HV 5822
# 2 QE 511
# 3 TL 515
# 4 D 753
# 5 DB 89
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.