在R中的數據表中將文本列拆分為參差不齊的多個新列

Question

我有一個包含20000+行和一列的數據表。 每列中的字符串具有不同數量的單詞。 我想拆分單詞，然后將每個單詞放到新列中。 我知道我該怎么做：

Data [ , Word1 := as.character(lapply(strsplit(as.character(Data$complaint), split=" "), "[", 1))]

（ Data是我的數據表， complaint是列的名稱）

顯然，這是無效的，因為每一行中的每個單元格都有不同數量的單詞。

您能告訴我一個更有效的方法嗎？

Answer 1

從我的“ splitstackshape”包中檢查cSplit 。 它可以在data.frame或data.table （但始終返回data.table ）。

假設KFB的樣本數據至少可以代表您的實際數據，您可以嘗試：

library(splitstackshape)
cSplit(df, "x", " ")
#     x_1      x_2         x_3 x_4
# 1: This       is interesting  NA
# 2: This actually          is not

另一個（ stri_split_fixed ）選擇是使用stri_split_fixed帶有“ simple simplify = TRUE ”）（來自“ stringi”）（顯然可以認為很快就會輸入“ splitstackshape”代碼）：

library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
#      [,1]   [,2]       [,3]          [,4] 
# [1,] "This" "is"       "interesting" NA   
# [2,] "This" "actually" "is"          "not"

Answer 2

從CRAN的1.9.6版本開始，有兩個函數transpose()和tstrsplit()可用。

這樣我們可以做到：

require(data.table)
setDT(tstrsplit(as.character(df$x), " ", fixed=TRUE))[]
#      V1       V2          V3  V4
# 1: This       is interesting  NA
# 2: This actually          is not

tstrsplit是transpose(strsplit(...))的包裝器。

Answer 3

這是一個基於plyr包中的plyr rbind.fill.matrix(...)的plyr方案。 在具有20,000行的數據集上，它的運行時間約為3.6秒。

# create an sample dataset - you have this already
library(data.table)
words <- LETTERS[1:10]     # "words" are just letters in this example
set.seed(1)                # for reproducible example
w  <- sapply(1:2e4,function(i)paste(words[sample(1:10,sample(1:10,1))],collapse=" "))
dt <- data.table(words=w)
head(dt)
#          complaint
# 1:           D F H
# 2:           I J F
# 3:   A B I E C D H
# 4: J D G H B I A E
# 5:         A D G C
# 6:       F E B J I

# you start here...
library(plyr)
result <- rbind.fill.matrix(lapply(strsplit(dt$words, split=" "),matrix,nr=1))
result <- as.data.table(result)
head(result)
#    1 2 3  4  5  6  7  8  9 10
# 1: D F H NA NA NA NA NA NA NA
# 2: I J F NA NA NA NA NA NA NA
# 3: A B I  E  C  D  H NA NA NA
# 4: J D G  H  B  I  A  E NA NA
# 5: A D G  C NA NA NA NA NA NA
# 6: F E B  J  I NA NA NA NA NA

編輯：添加了一些基於@Ananda的注釋的基准測試。

f.rfm    <- function() as.data.table(rbind.fill.matrix(lapply(strsplit(dt$complaint, split=" "),matrix,nr=1)))
library(splitstackshape)
f.csplit <- function() cSplit(dt, "complaint", " ",type.convert=FALSE)
library(stringi)
f.sl2m   <- function() as.data.table(stri_list2matrix(strsplit(dt$complaint, split=" "), byrow = TRUE))
f.ssf    <- function() as.data.table(stri_split_fixed(dt$complaint, " ", simplify = TRUE))

all.equal(f.rfm(),f.csplit(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.sl2m(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.ssf(),check.names=FALSE)
# [1] TRUE
library(microbenchmark)
microbenchmark(f.rfm(),f.csplit(),f.sl2m(),f.ssf(),times=10)
# Unit: milliseconds
#        expr        min         lq     median        uq        max neval
#     f.rfm() 3566.17724 3589.31203 3606.93303 3665.4087 3719.32299    10
#  f.csplit()   98.05709  102.46456  104.51046  107.9588  117.26945    10
#    f.sl2m()   55.45527   55.58852   56.75406   58.9347   67.44523    10
#     f.ssf()   17.77499   17.98879   18.30831   18.4537   21.62161    10

因此，看起來stri_split_fixed(...)是贏家。

Answer 4

一個示例數據可能會很好，但是如果我了解您想要的內容，就不可能在數據框中正確執行操作。 鑒於每行中的單詞數量不同，您將需要一個列表。 即使如此，在整個對象中拆分單詞也非常簡單。

如果運行strsplit(as.character(Data[,1]), " ")您將獲得一個列表，其中每個元素對應於數據框中的一行。 因此，有幾種不同的方法可以重新排列此對象，但是最佳方法取決於您的目標

Answer 5

對於data.table和data.frame都可以

# toy data
df <- structure(list(x = structure(c(2L, 1L), .Label = c("This actually is not", 
"This is interesting"), class = "factor")), .Names = "x", row.names = c(NA, 
-2L), class = "data.frame")

#                      x
# 1  This is interesting
# 2 This actually is not

# the code
split_result <- strsplit(as.character(df$x), " ")
length_n <- sapply(split_result, length)
length_max <- seq_len(max(length_n))
as.data.frame(t(sapply(split_result, "[", i = length_max))) # Or as.data.table(...)

#     V1       V2          V3   V4
# 1 This       is interesting <NA>
# 2 This actually          is  not

在R中的數據表中將文本列拆分為參差不齊的多個新列

問題描述

5 個解決方案

解決方案1
10 已采納 2014-11-13 01:47:12

解決方案2
10 2015-01-27 19:21:56

解決方案3
3 2014-11-13 00:49:25

解決方案4
2 2014-11-13 00:06:23

解決方案5
2 2014-11-13 00:44:10

在R中的數據表中將文本列拆分為參差不齊的多個新列

問題描述

5 個解決方案

解決方案1 10 已采納 2014-11-13 01:47:12

解決方案2 10 2015-01-27 19:21:56

解決方案3 3 2014-11-13 00:49:25

解決方案4 2 2014-11-13 00:06:23

解決方案5 2 2014-11-13 00:44:10

解決方案1
10 已采納 2014-11-13 01:47:12

解決方案2
10 2015-01-27 19:21:56

解決方案3
3 2014-11-13 00:49:25

解決方案4
2 2014-11-13 00:06:23

解決方案5
2 2014-11-13 00:44:10