简体   繁体   English

在R中分割字符串时,有效替换for循环

[英]Efficient replacement for for-loop when splitting strings in R

I have a large dataframe (20 columns, >100k rows) and need to split a column of character strings into multiple new columns. 我有一个大型数据框(20列,> 100k行),需要将一列字符串拆分成多个新列。

The first 3 observations of the column in question are something like this: 有关列的前3个观察结果如下:

scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour", 
                 "L-S Spine,L-S Spine"))

which looks like this: 看起来像这样:

                          scan
1 CT Cervical Sp,CT Head Plain
2                  II < 1 Hour
3          L-S Spine,L-S Spine

I need to split this into 5 columns (there are a maximum of 5 substrings in each observation), and for observations with fewer substrings I want the remaining columns filled with NAs. 我需要将其拆分为5列(每个观察中最多有5个子串),对于具有较少子串的观察,我希望剩余的列填充有NA。 I am currently using this code: 我目前正在使用此代码:

scans <- data.frame(scan = c("CT Cervical Sp,CT Head Plain", "II < 1 Hour",
"L-S Spine,L-S Spine"))

for(i in 1:nrow(scans)){
  scans$scan1[i] <- strsplit(scans$scan, ",")[[i]][1]
  scans$scan2[i] <- strsplit(scans$scan, ",")[[i]][2]
  scans$scan3[i] <- strsplit(scans$scan, ",")[[i]][3]
  scans$scan4[i] <- strsplit(scans$scan, ",")[[i]][4]
  scans$scan5[i] <- strsplit(scans$scan, ",")[[i]][5]
}

which works and outputs my desired solution: 它工作并输出我想要的解决方案:

                          scan          scan1         scan2 scan3 scan4 scan5
1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain    NA    NA    NA
2                  II < 1 Hour    II < 1 Hour            NA    NA    NA    NA
3          L-S Spine,L-S Spine      L-S Spine     L-S Spine    NA    NA    NA

... but it is really slow. ......但它确实很慢。 Looping over tens or hundreds of thousands of observations is time consuming. 循环数十或数十万次观测是耗时的。

Many thanks for any advice. 非常感谢任何建议。

Another way is to use tstrsplit in the devel version of data.table 另一种方法是使用tstrsplit开发人员版本data.table

library(data.table) # v >= 1.9.5
setDT(scans)[, tstrsplit(scan, ",", fixed = TRUE)]
#                V1            V2
# 1: CT Cervical Sp CT Head Plain
# 2:    II < 1 Hour            NA
# 3:      L-S Spine     L-S Spine 

If you sure you will have 5 splits at least once, you could easily create these columns by reference 如果您确定至少有5次拆分,则可以通过引用轻松创建这些列

setDT(scans)[, paste0("scan", 1:5) := tstrsplit(scan, ",")]

Alternatively, the tidyr package offers a similar functuanality 或者, tidyr包提供类似的functuanality

library(tidyr)
separate(scans, scan, paste0("scan", 1:2), ",", extra = "merge", remove = FALSE)
#                           scan          scan1         scan2
# 1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain
# 2                  II < 1 Hour    II < 1 Hour          <NA>
# 3          L-S Spine,L-S Spine      L-S Spine     L-S Spine

Or another option using only base R 或者仅使用base R另一种选择

 cbind(scans, read.table(text= as.character(scans$scan),sep=",", fill=TRUE, na.strings=''))

You can use: 您可以使用:

library(splitstackshape)
cSplit(scans, colnames(scans), sep=',')

#           scan_1        scan_2
#1: CT Cervical Sp CT Head Plain
#2:    II < 1 Hour            NA
#3:      L-S Spine     L-S Spine

Beware that the object returned is a data.table . 请注意返回的对象是data.table You can convert to a data.frame if needed. 如果需要,您可以转换为data.frame Here there is only two columns because there are only at maximum one comma in the data. 这里只有两列,因为数据中最多只有一个逗号。 If you apply it on data with some cells with 4 commas, you will get your desired output. 如果将其应用于带有4个逗号的某些单元格的数据,您将获得所需的输出。

Use the amazing stringi package -- I challenge anyone to find a faster solution. 使用惊人的stringi包 - 我挑战任何人找到更快的解决方案。

# this does all the work
result <- as.data.frame(stringi::stri_split_fixed(scans$scan, ",", simplify = TRUE))

This will fill with as many columns as you have comma delimiters. 这将填充与逗号分隔符一样多的列。

To get the exact results from the question, rename the columns and convert empty strings to NA : 要从问题中获得准确的结果,请重命名列并将空字符串转换为NA

# rename the columns if you wish
names(result) <- paste0("scan", 1:ncol(result))
# replace "" with NA
result[result==""] <- NA

cbind(scans, result)
##                           scan          scan1         scan2
## 1 CT Cervical Sp,CT Head Plain CT Cervical Sp CT Head Plain
## 2                  II < 1 Hour    II < 1 Hour          <NA>
## 3          L-S Spine,L-S Spine      L-S Spine     L-S Spine

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM