如何在 R 中將字符串字段分隔為兩個不同的數字列

Question

我有一個數據框，它有一個文本字段，用於記錄一個人在一個城市停留的時間。 格式為y year(s) m month(s) ，y 和m 為數字。 如果此人在該城市居住不到一年，則該值的格式僅為m months

我想將此列轉換為兩個單獨的數字列，其中一個顯示生活年數，另一個顯示生活月份。

這是我的數據框示例：

df <- structure(list(Time.in.current.role = c("1 year 1 month", "11 
months", 
"3 years 11 months", "1 year 1 month", "8 months"), City = 
c("Philadelphia", 
"Seattle", "Washington D.C.", "Ashburn", "Cork, Ireland")), .Names = 
c("Time.in.current.role", 
"City"), row.names = c(NA, 5L), class = "data.frame")

我想要的數據框看起來像：

result <- structure(list(Year = c(1, 0, 3, 1, 0), Month = c(1, 11, 
11, 
1, 8), City = structure(c(3L, 4L, 5L, 1L, 2L), .Label = c("Ashburn", 
"Cork, Ireland", "Philadelphia", "Seattle", "Washington D.C."
), class = "factor")), .Names = c("Year", "Month", "City"), row.names 
= c(NA, 
-5L), class = "data.frame")

我正在考慮使用 grep 來定位哪些行中有子字符串“year”，哪些行中有子字符串“month”。 但在那之后，我無法獲得與“年”或“月”適當關聯的數字。

* 編輯 *在我原來的帖子中，我忘了說明可能只有y year(s) 。 這是新的原始數據框和所需的數據框：

df <- structure(list(Time.in.current.role = c("1 year 1 month", "11 
months", 
"3 years 11 months", "1 year 1 month", "8 months", "2 years"), 
City = c("Philadelphia", "Seattle", "Washington D.C.", "Ashburn", 
"Cork, Ireland", "Washington D.C.")), .Names = 
c("Time.in.current.role", 
"City"), row.names = c(1L, 2L, 3L, 4L, 5L, 18L), class = 
"data.frame")

result <- structure(list(Year = c(1, 0, 3, 1, 0, 2), Month = c(1, 11, 
11, 
1, 8, 0), City = structure(c(3L, 4L, 5L, 1L, 2L, 5L), .Label = 
c("Ashburn", 
"Cork, Ireland", "Philadelphia", "Seattle", "Washington D.C."
), class = "factor")), .Names = c("Year", "Month", "City"), row.names 
= c(NA, 
-6L), class = "data.frame")

Answer 1

您可以執行以下操作：

z = regmatches(x = df$Time.in.current.role, gregexpr("\\d+", df$Time.in.current.role))
years = sapply(z, function(x){ifelse(length(x)==1, 0, x[1])})
months = sapply(z, function(x){ifelse(length(x)==1, x[1], x[2])})

這給出：

> years
[1] "1" "0" "3" "1" "0"
> months
[1] "1"  "11" "11" "1"  "8"

如果有或兩個數字，則此方法有效。 如果只有一個，則假定它對應於幾個月。 例如，這不起作用的情況是"5 years" 。

在這種情況下，您可以執行以下操作：

m = regmatches(x = df$Time.in.current.role, gregexpr("\\d+ m", df$Time.in.current.role))
y = regmatches(x = df$Time.in.current.role, gregexpr("\\d+ y", df$Time.in.current.role))
y2 = sapply(y, function(x){ifelse(length(x)==0,0,gsub("\\D+","",x))})
m2 = sapply(m, function(x){ifelse(length(x)==0,0,gsub("\\D+","",x))})

示例：

> df
  Time.in.current.role            City
1       1 year 1 month    Philadelphia
2            11 months         Seattle
3    3 years 11 months Washington D.C.
4       1 year 1 month         Ashburn
5             8 months   Cork, Ireland
6              5 years           Miami

> y2
[1] "1" "0" "3" "1" "0" "5"
> m2
[1] "1"  "11" "11" "1"  "8"  "0"

Answer 2

另一種方法是使用包splitstackshape將列一分為二。 為此，您首先需要使用 gsub 在年和月之間設置一個分隔符，然后刪除所有字符，然后使用cSplit ：

# replace delimiter year with ;
df$Time.in.current.role <- gsub("year", ";", df$Time.in.current.role)

# If no year was found add 0; at the beginning of the cell
df$Time.in.current.role[!grepl(";", df$Time.in.current.role)] <- paste0("0;", df$Time.in.current.role[!grepl(";", df$Time.in.current.role)])

# remove characters and whitespace
df$Time.in.current.role <- gsub("[[:alpha:]]|\\s+", "", df$Time.in.current.role)

# Split column by ;
df <- splitstackshape::cSplit(df, "Time.in.current.role", sep = ";")

# Rename new columns
colnames(df)[2:3] <- c("Year", "Month")

df
              City  Year  Month
1:    Philadelphia     1      1
2:         Seattle     0     11
3: Washington D.C.     3     11
4:         Ashburn     1      1
5:   Cork, Ireland     0      8

Answer 3

一個快速的“骯臟”解決方案：

代碼：

ym <- gsub("[^0-9|^ ]", "", df$Time.in.current.role)
ym <- gsub("^ | $", "", ym)
df$Year <- ifelse(
  grepl(" ", ym), 
  gsub("([0-9]+) .+", "\\1", ym), 
  0
)
df$Month <- gsub(".+ ([0-9]+)$", "\\1", ym)
df$Time.in.current.role <- NULL
df

             City Year Month
1    Philadelphia    1     1
2         Seattle    0    11
3 Washington D.C.    3    11
4         Ashburn    1     1
5   Cork, Ireland    0     8

話：

首先刪除不是數字或空格的所有內容
刪除字符串開頭或結尾的所有空格
如果字符串包含兩個數字，則首先提取為年份，否則為year = 0 。
最后一個數字總是月份。
從 data.frame 中刪除原始列
享受

Answer 4

這定義了一個函數extr （也見最后的替代定義），它將從它的第一個參數中提取與第二個參數的捕獲組的匹配，即與括號內的正則表達式部分的匹配。 然后將匹配轉換為數字，或者如果找不到模式，則返回 0。

它只有 3 行代碼，在處理年份和月份的方式上具有令人愉悅的對稱性，不僅可以處理年份和月份，還可以處理年份和月份。 它允許在 y 和 m 之前出現垃圾，例如問題示例數據中顯示的 \\n。

library(gsubfn)

extr <- function(x, pat) strapply(x, pat, as.numeric, empty = 0, simplify = TRUE)
transform(df, Year = extr(Time.in.current.role, "(\\d+) +\\W*y"),
              Month = extr(Time.in.current.role, "(\\d+) +\\W*m"))

給予（對於問題中定義的數據框）：

  Time.in.current.role            City Year Month
1       1 year 1 month    Philadelphia    1     1
2          11 \nmonths         Seattle    0    11
3    3 years 11 months Washington D.C.    3    11
4       1 year 1 month         Ashburn    1     1
5             8 months   Cork, Ireland    0     8

請注意， strapply使用 tcl regex 引擎，但如果 tcltk 在您的系統上不起作用，則使用這個稍長版本的extr或者更好的是修復您的安裝，因為 tcltk 是一個基本包，如果這不起作用，您的 R安裝壞了。

extr <- function(x, pat) {
  sapply(strapply(x, pat, as.numeric), function(x) if (is.null(x)) 0 else x)
}

如何在 R 中將字符串字段分隔為兩個不同的數字列

問題描述

4 個解決方案

解決方案1
1 2018-03-16 15:03:14

解決方案2
1 2018-03-16 15:07:29

解決方案3
1 2018-03-16 15:07:59

解決方案4
1 已采納 2018-03-16 15:24:15

如何在 R 中將字符串字段分隔為兩個不同的數字列

問題描述

4 個解決方案

解決方案1 1 2018-03-16 15:03:14

解決方案2 1 2018-03-16 15:07:29

解決方案3 1 2018-03-16 15:07:59

解決方案4 1 已采納 2018-03-16 15:24:15

解決方案1
1 2018-03-16 15:03:14

解決方案2
1 2018-03-16 15:07:29

解決方案3
1 2018-03-16 15:07:59

解決方案4
1 已采納 2018-03-16 15:24:15