删除 R data.table 中的双引号边界行

Question

I have several improperly formatted csvs that are tab separated, but have a double-quote bounding each row.我有几个格式不正确的 csvs，它们是制表符分隔的，但每行都有一个双引号。 I can read them in and ignore the " with:我可以阅读它们并忽略"与：

library(data.table)
files = list.files(pattern="*.csv")
dt = lapply(files, fread, sep="\t", quote="")
setattr(dt, 'names', gsub(".csv", "", files))

but is there a R data.table way of handling the quotes beyond separate commands to strip first and last columns?但是有没有一种 R data.table方法来处理除单独命令之外的引号以去除第一列和最后一列？

# sample table
DT = data.table(V1=paste0("\"", 1:5), V2=c(1,2,5,6,8), 
                V3=c("a\"","b\"","c\"","d\"","e\""))
dt = list(DT, DT, DT)

# these work but aren't using data.table 
dt = lapply(dt, function(i) {
  i[[1]] = gsub('"', '', i[[1]])
  i[[ncol(i)]] = gsub('"', '', i[[ncol(i)]])
  i
})

# magical mystery operation that doesn't work???
dt = lapply(dt, function(i){
    i[, .SD := gsub('"', '', rep(.SD)), .SDcols=names(i)[c(1, ncol(i))]]
})

Answer 1

Use either index or column names to assign使用索引或列名来分配

library(data.table)
lapply(dt, \(x) {
   # // get the column names based on the index 1st and last column
   nm1 <- names(x)[c(1, length(x))]
   # loop over the Subset of Data.table (.SD), use `gsub` 
   # after specifying the columns to select in .SDcols
   # assign the output back to the columns of interest (nm1)
   x[, (nm1) := lapply(.SD, gsub, pattern = '"', replacement = ''), 
          .SDcols = nm1][]
 })

-output -输出

[[1]]
       V1    V2     V3
   <char> <num> <char>
1:      1     1      a
2:      2     2      b
3:      3     5      c
4:      4     6      d
5:      5     8      e

[[2]]
       V1    V2     V3
   <char> <num> <char>
1:      1     1      a
2:      2     2      b
3:      3     5      c
4:      4     6      d
5:      5     8      e

[[3]]
       V1    V2     V3
   <char> <num> <char>
1:      1     1      a
2:      2     2      b
3:      3     5      c
4:      4     6      d
5:      5     8      e

Another option is set set了另一个选项

lapply(dt, \(x) {
   
   nm1 <- names(x)[c(1, length(x))]
   for(j in nm1) set(x, i = NULL, j = j, value = gsub('"', '', x[[j]]))
 })

-output -输出

dt
[[1]]
       V1    V2     V3
   <char> <num> <char>
1:      1     1      a
2:      2     2      b
3:      3     5      c
4:      4     6      d
5:      5     8      e

[[2]]
       V1    V2     V3
   <char> <num> <char>
1:      1     1      a
2:      2     2      b
3:      3     5      c
4:      4     6      d
5:      5     8      e

[[3]]
       V1    V2     V3
   <char> <num> <char>
1:      1     1      a
2:      2     2      b
3:      3     5      c
4:      4     6      d
5:      5     8      e

删除 R data.table 中的双引号边界行

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-02-25 20:01:59

删除 R data.table 中的双引号边界行

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-02-25 20:01:59

解决方案1
1 已采纳 2022-02-25 20:01:59