![](/img/trans.png)
[英]How to compare two different columns(both contains string) efficiently in R?
[英]How to delimit a string field into two different numeric columns in R
我有一個數據框,它有一個文本字段,用於記錄一個人在一個城市停留的時間。 格式為y year(s) m month(s)
,y 和m 為數字。 如果此人在該城市居住不到一年,則該值的格式僅為m months
我想將此列轉換為兩個單獨的數字列,其中一個顯示生活年數,另一個顯示生活月份。
這是我的數據框示例:
df <- structure(list(Time.in.current.role = c("1 year 1 month", "11
months",
"3 years 11 months", "1 year 1 month", "8 months"), City =
c("Philadelphia",
"Seattle", "Washington D.C.", "Ashburn", "Cork, Ireland")), .Names =
c("Time.in.current.role",
"City"), row.names = c(NA, 5L), class = "data.frame")
我想要的數據框看起來像:
result <- structure(list(Year = c(1, 0, 3, 1, 0), Month = c(1, 11,
11,
1, 8), City = structure(c(3L, 4L, 5L, 1L, 2L), .Label = c("Ashburn",
"Cork, Ireland", "Philadelphia", "Seattle", "Washington D.C."
), class = "factor")), .Names = c("Year", "Month", "City"), row.names
= c(NA,
-5L), class = "data.frame")
我正在考慮使用 grep 來定位哪些行中有子字符串“year”,哪些行中有子字符串“month”。 但在那之后,我無法獲得與“年”或“月”適當關聯的數字。
* 編輯 *在我原來的帖子中,我忘了說明可能只有y year(s)
。 這是新的原始數據框和所需的數據框:
df <- structure(list(Time.in.current.role = c("1 year 1 month", "11
months",
"3 years 11 months", "1 year 1 month", "8 months", "2 years"),
City = c("Philadelphia", "Seattle", "Washington D.C.", "Ashburn",
"Cork, Ireland", "Washington D.C.")), .Names =
c("Time.in.current.role",
"City"), row.names = c(1L, 2L, 3L, 4L, 5L, 18L), class =
"data.frame")
result <- structure(list(Year = c(1, 0, 3, 1, 0, 2), Month = c(1, 11,
11,
1, 8, 0), City = structure(c(3L, 4L, 5L, 1L, 2L, 5L), .Label =
c("Ashburn",
"Cork, Ireland", "Philadelphia", "Seattle", "Washington D.C."
), class = "factor")), .Names = c("Year", "Month", "City"), row.names
= c(NA,
-6L), class = "data.frame")
您可以執行以下操作:
z = regmatches(x = df$Time.in.current.role, gregexpr("\\d+", df$Time.in.current.role))
years = sapply(z, function(x){ifelse(length(x)==1, 0, x[1])})
months = sapply(z, function(x){ifelse(length(x)==1, x[1], x[2])})
這給出:
> years
[1] "1" "0" "3" "1" "0"
> months
[1] "1" "11" "11" "1" "8"
如果有或兩個數字,則此方法有效。 如果只有一個,則假定它對應於幾個月。 例如,這不起作用的情況是"5 years"
。
在這種情況下,您可以執行以下操作:
m = regmatches(x = df$Time.in.current.role, gregexpr("\\d+ m", df$Time.in.current.role))
y = regmatches(x = df$Time.in.current.role, gregexpr("\\d+ y", df$Time.in.current.role))
y2 = sapply(y, function(x){ifelse(length(x)==0,0,gsub("\\D+","",x))})
m2 = sapply(m, function(x){ifelse(length(x)==0,0,gsub("\\D+","",x))})
示例:
> df
Time.in.current.role City
1 1 year 1 month Philadelphia
2 11 months Seattle
3 3 years 11 months Washington D.C.
4 1 year 1 month Ashburn
5 8 months Cork, Ireland
6 5 years Miami
> y2
[1] "1" "0" "3" "1" "0" "5"
> m2
[1] "1" "11" "11" "1" "8" "0"
另一種方法是使用包splitstackshape
將列一分為二。 為此,您首先需要使用 gsub 在年和月之間設置一個分隔符,然后刪除所有字符,然后使用cSplit
:
# replace delimiter year with ;
df$Time.in.current.role <- gsub("year", ";", df$Time.in.current.role)
# If no year was found add 0; at the beginning of the cell
df$Time.in.current.role[!grepl(";", df$Time.in.current.role)] <- paste0("0;", df$Time.in.current.role[!grepl(";", df$Time.in.current.role)])
# remove characters and whitespace
df$Time.in.current.role <- gsub("[[:alpha:]]|\\s+", "", df$Time.in.current.role)
# Split column by ;
df <- splitstackshape::cSplit(df, "Time.in.current.role", sep = ";")
# Rename new columns
colnames(df)[2:3] <- c("Year", "Month")
df
City Year Month
1: Philadelphia 1 1
2: Seattle 0 11
3: Washington D.C. 3 11
4: Ashburn 1 1
5: Cork, Ireland 0 8
一個快速的“骯臟”解決方案:
代碼:
ym <- gsub("[^0-9|^ ]", "", df$Time.in.current.role)
ym <- gsub("^ | $", "", ym)
df$Year <- ifelse(
grepl(" ", ym),
gsub("([0-9]+) .+", "\\1", ym),
0
)
df$Month <- gsub(".+ ([0-9]+)$", "\\1", ym)
df$Time.in.current.role <- NULL
df
City Year Month
1 Philadelphia 1 1
2 Seattle 0 11
3 Washington D.C. 3 11
4 Ashburn 1 1
5 Cork, Ireland 0 8
話:
year = 0
。這定義了一個函數extr
(也見最后的替代定義),它將從它的第一個參數中提取與第二個參數的捕獲組的匹配,即與括號內的正則表達式部分的匹配。 然后將匹配轉換為數字,或者如果找不到模式,則返回 0。
它只有 3 行代碼,在處理年份和月份的方式上具有令人愉悅的對稱性,不僅可以處理年份和月份,還可以處理年份和月份。 它允許在 y 和 m 之前出現垃圾,例如問題示例數據中顯示的 \\n。
library(gsubfn)
extr <- function(x, pat) strapply(x, pat, as.numeric, empty = 0, simplify = TRUE)
transform(df, Year = extr(Time.in.current.role, "(\\d+) +\\W*y"),
Month = extr(Time.in.current.role, "(\\d+) +\\W*m"))
給予(對於問題中定義的數據框):
Time.in.current.role City Year Month
1 1 year 1 month Philadelphia 1 1
2 11 \nmonths Seattle 0 11
3 3 years 11 months Washington D.C. 3 11
4 1 year 1 month Ashburn 1 1
5 8 months Cork, Ireland 0 8
請注意, strapply
使用 tcl regex 引擎,但如果 tcltk 在您的系統上不起作用,則使用這個稍長版本的extr
或者更好的是修復您的安裝,因為 tcltk 是一個基本包,如果這不起作用,您的 R安裝壞了。
extr <- function(x, pat) {
sapply(strapply(x, pat, as.numeric), function(x) if (is.null(x)) 0 else x)
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.