[英]Splitting string into unknown number of new dataframe columns
我有一個帶有字符列的數據框,其中包含以換行符\\n
分隔的多個字符串形式的電子郵件元數據:
person myString
1 John To name5@email.com by sender6 on 01-12-2014\n
2 Jane To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n
3 Tim To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n
我想將myString的不同子字符串拆分為不同的列,以使其看起來像這樣:
person email1 email2 email3
1 John To name5@email.com by sender6 on 01-12-2014 <NA> <NA>
2 Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
3 Tim To name2@email.com by sender2 on 05-11-2014 To name@email.com by sender2 on 06-03-2015 <NA>
我當前的方法使用separate
tidyr軟件包不同的方法:
library(dplyr)
library(tidyr)
res1 <- df %>%
separate(col = myString, into = paste(rep("email", 3), 1:3), sep = "\\n", extra = "drop")
res1[res1 == ""] <- NA
但是使用這種方法,我必須手動指定要提取的三列。
我希望通過以下一項或兩項來改進此過程:
而且,如果有一個好的解決方案以長格式(而不是寬格式)返回數據,那也很好。
樣本數據:
df <- structure(list(person = c("John", "Jane", "Tim"), myString = c("To name5@email.com by sender6 on 01-12-2014\n",
"To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n",
"To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n"
)), .Names = c("person", "myString"), row.names = c(NA, -3L), class = "data.frame")
我建議cSplit
從我的“splitstackshape”套餐:
library(splitstackshape)
cSplit(df, "myString", "\n")
# person myString_1
# 1: John To name5@email.com by sender6 on 01-12-2014
# 2: Jane To name@email.com,name4@email.com by sender1 on 01-22-2014
# 3: Tim To name2@email.com by sender2 on 05-11-2014
# myString_2
# 1: NA
# 2: To name3@email.com by sender2 on 02-03-2014
# 3: To name@email.com by sender2 on 06-03-2015
# myString_3
# 1: NA
# 2: To email5@domain.com by sender1 on 06-21-2014
# 3: NA
您也可以嘗試使用參數“ simple simplify = TRUE
”從“ stringi”包中嘗試stri_split_fixed
(盡管對於示例數據,這會在末尾添加一個額外的空列)。 該方法將類似於:
library(stringi)
data.frame(person = df$person,
stri_split_fixed(df$myString, "\n",
simplify = TRUE))
似乎很hacky,但是您可以...
使用strsplit分割char向量。 獲取最大長度,將其用於列。
df <- data.frame(
person = c("John", "Jane", "Tim"),
myString = c("To name5@email.com by sender6 on 01-12-2014\n",
"To name@email.com,name4@email.com by sender1 on 01-22-2014\nTo name3@email.com by sender2 on 02-03-2014\nTo email5@domain.com by sender1 on 06-21-2014\n",
"To name2@email.com by sender2 on 05-11-2014\nTo name@email.com by sender2 on 06-03-2015\n"
), stringsAsFactors=FALSE
)
a <- strsplit(df$myString, "\n")
max_len <- max(sapply(a, length))
for(i in 1:max_len){
df[,paste0("email", i)] <- sapply(a, "[", i)
}
這是長格式的有效途徑:
a <- strsplit(df$myString, "\n")
lens <- vapply(a, length, integer(1L)) # or lengths(a) in R 3.2
longdf <- df[rep(seq_along(a), lens),]
longdf$string <- unlist(a)
請注意, stack()
在這些情況下通常很有用。
可以通過使用IRanges Bioconductor軟件包進行簡化:
longdf <- df[togroup(a),]
longdf$string <- unlist(a)
然后,如果確實有必要,請轉至寬幅表:
longdf$myString <- NULL
longdf$token <- sequence(lens)
widedf <- reshape(longdf, timevar="token", idvar="person", direction="wide")
這可能就足夠了:
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
dt[, strsplit(myString, split = "\n"), by = person]
# person V1
#1: John To name5@email.com by sender6 on 01-12-2014
#2: Jane To name@email.com,name4@email.com by sender1 on 01-22-2014
#3: Jane To name3@email.com by sender2 on 02-03-2014
#4: Jane To email5@domain.com by sender1 on 06-21-2014
#5: Tim To name2@email.com by sender2 on 05-11-2014
#6: Tim To name@email.com by sender2 on 06-03-2015
然后可以輕松轉換為寬格式:
dcast(dt[, strsplit(myString, split = "\n"), by = person][, idx := 1:.N, by = person],
person ~ idx, value.var = 'V1')
# person 1 2 3
#1: Jane To name@email.com,name4@email.com by sender1 on 01-22-2014 To name3@email.com by sender2 on 02-03-2014 To email5@domain.com by sender1 on 06-21-2014
#2: John To name5@email.com by sender6 on 01-12-2014 NA NA
#3: Tim To name2@email.com by sender2 on 05-11-2014 To name@email.com by sender2 on 06-03-2015 NA
# (load reshape2 and use dcast.data.table instead of dcast if not using 1.9.5+)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.