[英]Split delimited strings into distinct columns in R dataframe
I need a fast and concise way to split string literals in a data framte into a set of columns. 我需要一种快速简洁的方法来将数据框架中的字符串文字拆分为一组列。 Let's say I have this data frame
假设我有这个数据框
data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") )
(pls note the different delimiters among columns) (请注意列之间的不同分隔符)
The number of string columns is usually not known in advance (altough I can try to discover the whole set of cases if I've no alternatives) 通常不预先知道字符串列的数量(如果没有其他选择,我可以尝试发现全部情况)
I need two data frames like those: 我需要两个这样的数据框:
tok1.occurrences:
+----+---+---+---+---+---+
| id | a | b | c | d | e |
+----+---+---+---+---+---+
| 1 | 1 | 1 | 1 | 0 | 0 |
| 2 | 2 | 0 | 0 | 1 | 0 |
| 3 | 0 | 1 | 0 | 1 | 1 |
+----+---+---+---+---+---+
tok2.occurrences:
+----+-------+-------+---------+-------+-------+
| id | alpha | bravo | charlie | delta | tango |
+----+-------+-------+---------+-------+-------+
| 1 | 1 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 | 1 | 2 |
+----+-------+-------+---------+-------+-------+
I tried using this syntax: 我尝试使用以下语法:
tok1.f = factor(data$tok1)
dummies <- model.matrix(~tok1.f)
this ended up in a incomplete solution. 最终导致解决方案不完整。 It creates my dummy vars correctly, but not (obviously) splitting against the delimiter.
它会正确创建我的虚拟变量,但不会(显然)不对定界符进行分割。
I know i can use the 'tm' package to find a document-term matrix, but it's seems way too much for such simple tokenization. 我知道我可以使用'tm'包来查找文档术语矩阵,但是对于这种简单的标记化来说似乎太多了。 Is there a more straight way?
还有更直接的方法吗?
The easiest thing that I can think of is to use my cSplit
function in conjunction with dcast.data.table
, like this: 我想到的最简单的方法是将
cSplit
函数与dcast.data.table
结合使用,如下所示:
library(splitstackshape)
dcast.data.table(cSplit(data, "tok1", ", ", "long"),
id ~ tok1, value.var = "tok1",
fun.aggregate = length)
# id a b c d e
# 1: 1 1 1 1 0 0
# 2: 2 2 0 0 1 0
# 3: 3 0 1 0 1 1
dcast.data.table(cSplit(data, "tok2", "|", "long"),
id ~ tok2, value.var = "tok2",
fun.aggregate = length)
# id alpha bravo charlie delta tango
# 1: 1 1 1 0 0 0
# 2: 2 1 0 1 0 0
# 3: 3 0 0 0 1 2
Edit: Updated with library(splitstackshape)
since cSplit
is now part of that package. 编辑:更新了
library(splitstackshape)
因为cSplit
现在是该软件包的一部分。
If you don't mind using data.table
(temporarily), this might work for you: 如果您不介意暂时使用
data.table
,那么这可能对您data.table
:
library(data.table)
data <- data.frame(id=c(1,2,3),
tok1=c("a, b, c", "a, a, d", "b, d, e"),
tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta"))
splitCols <- function(col_name, data) {
# strsplit needs strings
data[, col_name] <- as.character(data[, col_name])
# make a list of single row data frames from the tabulation
# of each of items from the split column
tokens <- lapply(strsplit(data[, col_name], "[^[:alnum:]]+"), function(x) {
tab <- table(x)
setNames(rbind.data.frame(as.numeric(tab)), names(tab))
})
# use data.table's rbindlist, filling in missing values
rbl <- rbindlist(tokens, fill=TRUE)
# 0 out the NA's
rbl[is.na(rbl)] <- 0
# add the "id" column
cbind(id=data$id, rbl)
}
lapply(names(data)[-1], splitCols, data)
## [[1]]
## id a b c d e
## 1: 1 1 1 1 0 0
## 2: 2 2 0 0 1 0
## 3: 3 0 1 0 1 1
##
## [[2]]
## id alpha bravo charlie delta tango
## 1: 1 1 1 0 0 0
## 2: 2 1 0 1 0 0
## 3: 3 0 0 0 1 2
You end up with a list of data frames that you can then process as you see fit. 您最终得到一个数据帧列表,然后可以根据需要对其进行处理。
You could use stringr
package as follows: 您可以如下使用
stringr
包:
require(stringr)
test_data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") )
#conversion to character class and uniform delimeter as ","
test_data$tok1<-as.character(test_data$tok1)
test_data$tok1<-gsub(" ","",test_data$tok1)
test_data$tok2=gsub("\\|",",",as.character(test_data$tok2))
#Unique list of elements for each column
tok1.uniq=sort(unique(unlist(strsplit(as.character(test_data$tok1),","))))
tok2.uniq=sort(unique(unlist(strsplit(as.character(test_data$tok2),","))))
#Token count for each column
#In each row of token, find the count of characters using str_count from stringr package
Column one: 第一栏:
tok1.occurances=do.call(cbind,lapply(tok1.uniq,function(x) {
DF=data.frame(do.call(rbind,lapply(test_data$tok1,function(y,z=x) str_count(y,z))))
colnames(DF) = x
return(DF)
}
))
#Add ID number as column
tok1.occurances=data.frame(id=as.numeric(row.names(tok1.occurances)),tok1.occurances,stringsAsFactors=FALSE)
# > tok1.occurances
# id a b c d e
# 1 1 1 1 0 0
# 2 2 0 0 1 0
# 3 0 1 0 1 1
Column two: 第二栏:
tok2.occurances=do.call(cbind,lapply(tok2.uniq,function(x) {
DF=data.frame(do.call(rbind,lapply(test_data$tok2,function(y,z=x) str_count(y,z))))
colnames(DF) = x
return(DF)
}
))
tok2.occurances=data.frame(id=as.numeric(row.names(tok2.occurances)),tok2.occurances,stringsAsFactors=FALSE)
# > tok2.occurances
# id alpha bravo charlie delta tango
# 1 1 1 0 0 0
# 2 1 0 1 0 0
# 3 0 0 0 1 2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.