简体   繁体   English

GCalignR 的 R 预处理数据中的字符串拆分问题

[英]String split issue in R preprocessing data for GCalignR

Trying to take multiple .txt (GC-FID) files, pull out two columns of data and assign them to an object I can eventually pass to GCalignR.尝试获取多个 .txt (GC-FID) 文件,提取两列数据并将它们分配给我最终可以传递给 GCalignR 的对象。 Is there a better way to process this data for GCalignR?有没有更好的方法来处理 GCalignR 的这些数据?

Auxiliary functions causing issue:导致问题的辅助功能:

''' '''

# nonempty strings after splitting a row by one space
get_nonempty_splits = function(row) {
  s = strsplit(row, "[[:space:]]")
  s = s[[1]]
  l = c()
  for (i in 1:length(s)) {
    if (s[i] != "") {
      l = c(l, s[i])
    }
  }
  return(l)
}

# filenames have .txt, names do not
filenames_to_names = function(x) {
  l = c()
  for (i in 1:length(x)) {
    x1 = strsplit(x, "[.]")[[i]][1]
    l = c(l, x1)
  }
  return(l)
}

# get data row indices
get_data_row_inds = function(df) {
  ind_start = 0
  ind_end = 0
  for (i in 1:length(df)) {
    row = df[i]
    # find start
    if (grepl("----", row)) {
      stopifnot(ind_start == 0)  # assert ind_start not set
      ind_start = i+1
    }
    # find end
    if (i == length(df) && ind_end == 0) {
      ind_end = length(df)
    } else if (grepl("Totals", row) && grepl(":", row)) {
      stopifnot(ind_end == 0)  # assert ind_end not set
      ind_end = i-1
    }
  }
  stopifnot(ind_start != 0)
  stopifnot(ind_end != 0)
  return(ind_start:ind_end)
}

''' '''

''' '''

path_to_raw_data = "/Users/input"
path_to_processed_data = "."
# get paths for all .txt files in pathToRawData directory
paths = list.files(path_to_raw_data, full.name=TRUE, pattern = ".txt")
filenames = list.files(path_to_raw_data, full.name=FALSE, pattern = ".txt")
names = filenames_to_names(filenames)  # without .txt extension

# get data from text file
processed_data = list()
df_lengths = c()
for (i in 1:length(paths)) {  # i indexes the raw files
  path = paths[i]
  df = read.delim(path, fileEncoding= 'UTF-16LE', header=TRUE)
  df = df[[1]]
  inds = get_data_row_inds(df)
  df_lengths = c(df_lengths, length(inds))
  times = c()
  areas = c()
  for (j in inds) {  # j indexes the data rows of a raw file
    row = df[j]
    row = get_nonempty_splits(row)
    time = row_to_time(row)
    area = row_to_area(row)
    times = c(times, time)
    areas = c(areas, area)
  }
  pairs = data.frame(time = times, area = areas)
  processed_data[[i]] = pairs
}

''' '''

Getting this error: Error in strsplit(row, "[[:space:]]") : non-character argument收到此错误: strsplit(row, "[[:space:]]") 中的错误:非字符参数

Any advice how to solve this?任何建议如何解决这个问题? Is it the file encoding?是文件编码吗? Processed-data =list() returns nothing...? Processed-data =list() 什么都不返回...?

header of input: '''输入标题:'''

dput(head(df)) structure(c(59L, 53L, 45L, 48L, 47L, 52L), .Label = c(" Inj Volume : 1 µl", " *** End of Report ***", " Area Percent Report ", " 1 1.353 1 BB 2.85703 2.453e-5 ? ", " 2 1.952 1 BV 4411.39551 0.03787 ? ", " 3 2.058 1 VV 4693.20215 0.04029 ? ", " 4 2.089 1 VV 6614.89502 0.05679 ? ", " 5 2.139 2 0.00000 0.00000 NG ", " 6 2.452 2 0.00000 0.00000 1, 3-DNB ", " 7 3.149 2 0.00000 0.00000 2, 4-DNT ", " 8 3.315 1 VV S 1.15784e7 99.39858 ? ", " 9 3.347 1 VV S 5169.44629 0.04438 ? ", " # [min] %", " 10 3.372 1 VV S 2.09449e4 0.17981 ? ", " 11 3.466 1 VV S 2535.17432 0.02176 ? ", " 12 3.547 1 VB S 2.45685e4 0.21092 ? ", " 13 3.602 1 BV T 451.00174 0.00387 ? ", " 14 3.686 1 VV T 40.45324 0.00035 ? ", " 15 3.734 1 VV T 13.40936 0.00012 ? ", " 16 3.819 1 VB T 508.57788 0.00437 ? ", " 17 4.119 1 BB 13.01144 0.00011 ? ", " 18 4.856 2 0.00000 0.00000 TNT ", " 19 4.975 2 0.00000 0.00000 TNB ", " 20 5.549 2 0.00000 0.00000 4-Am-DNT ", " 21 5.869 2 0.00000 0.00000 RDX ", " 22 5.943 2 0.00000 dput(head(df)) 结构(c(59L, 53L, 45L, 48L, 47L, 52L), .Label = c(" Inj Volume : 1 µl", " *** End of Report ***", "面积百分比报告 “ ”1 1.353 1 BB 2.85703 2.453e-5'“, ”2 1.952 1 BV 4411.39551 0.03787?“, ”3 2.058 1 VV 4693.20215 0.04029?“, ”4 2.089 1 VV 6614.89502 0.05679?“,” 5 2.139 0.00000 2 0.00000 NG “ ”6 2.452 2 0.00000 0.00000 1,3-DNB“, ”7 3.149 2 0.00000 0.00000 2,4-DNT“, ”8 3.315 1 VV小号1.15784e7 99.39858?“,” 9 3.347 1 VV S 5169.44629 0.04438 ? ", " # [min] %", " 10 3.372 1 VV S 2.09449e4 0.17981 ? "," 11 3.466 1 VV S 2535.17239, 405 "2535.17235 "2535.17235 "2535.17235 "2.09449e4 1 VV S 2.09449e4 0.17235 "2.09449e4 13 3.602 1 BVŤ451.00174 0.00387? “ ”14 3.686 1 VVŤ40.45324 0.00035?“, ”15 3.734 1 VVŤ13.40936 0.00012?“, ”16 3.819 1 VBŤ508.57788 0.00437?“,” 17 4.119 1 BB 13.01144 0.00011 ? “ ”18 4.856 2 0.00000 0.00000 TNT“, ”19 4.975 2 0.00000 0.00000 TNB“, ”20 5.549 2 0.00000 0.00000 4-AM-DNT“, ”21 5.869 2 0.00000 0.00000 RDX“,” 22 5.943 2 0.00000 0.00000 2-Am-DNT ", " 23 6.516 2 0.00000 0.00000 Tetryl ", " 24 11.716 1 BB 1.75858 1.510e-5 ? 0.00000 2-Am-DNT”、“23 6.516 2 0.00000 0.00000 Tetryl”、“24 11.716 1 BB 1.75858 1.510e-5 ? ", " 25 14.243 1 BB 2.55644 2.195e-5 ? ", " 25 14.243 1 BB 2.55644 2.195e-5 ? ", " 26 16.654 1 BB 3.81723 3.277e-5 ? ", " 26 16.654 1 BB 3.81723 3.277e-5 ? ", " 27 18.826 1 BB 2.58369 2.218e-5 ? ", " 27 18.826 1 BB 2.58369 2.218e-5 ? ", " 28 20.800 1 BB 1.51171 1.298e-5 ? ", " 28 20.800 1 BB 1.51171 1.298e-5 ? ", " 29 24.159 1 BB 1.78975 1.536e-5 ? ", " 29 24.159 1 BB 1.78975 1.536e-5 ? ", " 30 24.269 1 BB 1.81180 1.555e-5 ? ", " 30 24.269 1 BB 1.81180 1.555e-5 ? ", " 31 25.053 1 BB 2.96617 2.546e-5 ? ", " 31 25.053 1 BB 2.96617 2.546e-5 ? ", " 32 25.658 1 BB 6.15337 5.283e-5 ? ", " 32 25.658 1 BB 6.15337 5.283e-5 ? ", " 33 25.809 1 BB 3.89435 3.343e-5 ? ", " 33 25.809 1 BB 3.89435 3.343e-5 ? ", " 34 26.577 1 BB 4.02199 3.453e-5 ? ", " 34 26.577 1 BB 4.02199 3.453e-5 ? ", " 35 26.885 1 BB 2.48416 2.133e-5 ? ", " 35 26.885 1 BB 2.48416 2.133e-5 ? ", " 36 27.219 1 BB 14.88012 0.00013 ? ", " 36 27.219 1 BB 14.88012 0.00013 ? ", " 37 27.465 1 BB 3.59732 3.088e-5 ? ", " 37 27.465 1 BB 3.59732 3.088e-5 ? ", " 38 29.377 1 BB 18.55422 0.00016 ? ", " 38 29.377 1 BB 18.55422 0.00016 ? ", " 39 32.554 1 BB 17.15620 0.00015 ? ", " 39 32.554 1 BB 17.15620 0.00015 ? ", "----|-------|---|------|----------|--------|-------------------------", "=====================================================================", "2 Warnings or Errors :", "Acq. ", "----|-------|---|------|---------|--------|--- ----------------------", "========================== ============================================", "2 个警告或错误:", "Acq. Instrument : Instrument 1 Location : Vial 11", "Acq.仪器 : 仪器 1 位置 : Vial 11", "Acq. Operator : HHV Seq.操作员:HHV Seq。 Line : 2", "Calib.行 : 2", "Calib. Data Modified : Tuesday, March 12, 2019 6:13:25 PM", "Dilution : 1.0000", "Do not use Multiplier & Dilution Factor with ISTDs", "Injection Date : 24-Feb-20, 14:37:34 Inj : 1", "Instrument 1 2/24/2020 3:13:35 PM HHV", "Last changed : 2/6/2020 12:59:45 PM by HHV", "Method : C:\\Chem32\\1\\DATA\\IPOULIN\\VOC_TEST_1 2020-02-24 13-49-15\\VOC_TEST_HV.M", "Method Info : VOC", "Multiplier : 1.0000", "Peak RetTime Sig Type Area Area Name", "Sample Name: P1U1 hex 022420", "Sequence File : C:\\Chem32\\1\\DATA\\IPOULIN\\VOC_TEST_1 2020-02-24 13-49-15\\VOC_TEST_1.S", "Signal 1: FID1 B, ", "Sorted By : Retention Time", "Totals : 1.16485e7 ", "Warning : Calibrated compound(s) not found", "Warning : Calibration warnings (see calibration table listing)" ), class = "factor") '''数据修改时间:2019 年 3 月 12 日,星期二下午 6:13:25”,“稀释:1.0000”,“请勿将乘数和稀释因子与 ISTD 一起使用”,“注射日期:24-Feb-20,14:37:34 Inj : 1", "Instrument 1 2/24/2020 3:13:35 PM HHV", "最后更改时间 : 2/6/2020 12:59:45 PM by HHV", "Method : C:\\Chem32\\1 \\DATA\\IPOULIN\\VOC_TEST_1 2020-02-24 13-49-15\\VOC_TEST_HV.M", "Method Info : VOC", "Multiplier : 1.0000", "Peak RetTime Sig Type Area Name", "Sample Name: P1U1 hex 022420", "序列文件: C:\\Chem32\\1\\DATA\\IPOULIN\\VOC_TEST_1 2020-02-24 13-49-15\\VOC_TEST_1.S", "信号 1: FID1 B, ", "排序方式: 保留Time", "Totals : 1.16485e7 ", "Warning : Calibrated Compound(s) not found", "Warning : Calibration warnings (see Calibration table lists)"), class = "factor") '''

Solved with adding: df = read.delim(path, fileEncoding= 'UTF-16LE', header=TRUE, stringsAsFactors = FALSE).通过添加解决:df = read.delim(path, fileEncoding='UTF-16LE', header=TRUE, stringsAsFactors = FALSE)。 The "stringsAsFactors = FALSE" being the solution. “stringsAsFactors = FALSE”是解决方案。 Thank you.谢谢你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM