简体   繁体   English

如何在R中读取大的不规则数据文件的特定行

[英]How to Read Specific Lines of A Large Irregular Data File in R

I work in data management, meaning people give me raw data and I have to format and parse it to get the pieces I need and organize it in a way that makes sense. 我从事数据管理工作,这意味着人们会给我原始数据,并且我必须格式化和解析它以获得所需的片段并以一种有意义的方式对其进行组织。 Currently the data I'm working with is a log file, but I have opened and saved it as a text file. 目前,我正在使用的数据是一个日志文件,但是我已经打开并将其保存为文本文件。 It looks a bit like this: 它看起来像这样:

M 20160525 09:51:11.822 DOC1: Clearing stale DENIED send to 1864130A.62274 in 13 after 39411ms M 20160525 09:51:11.822 DOC1:清除陈旧的DENIED在39411ms后的13中发送给1864130A.62274

D 20160525 09:51:11.824 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" Done D 20160525 09:51:11.824 F798257E GET 10.19.100.24:62274 van8tc-“ / pcgc / public / Other /”“ * li”完成

M 20160525 09:51:11.825 DOC1: F798257E Transaction has been acknowledged at 15804727 M 20160525 09:51:11.825 DOC1:F798257E交易已在15804727确认

F 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" 441 (0,0) "0.10 seconds (36.8 kilobits/sec)" D 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" - "Freeing Package Unit" F 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc-“ / pcgc / public / Other /”“ * li” 441(0,0)“ 0.10秒(36.8 kb / sec)” D 20160525 09:51 :11.825 F798257E GET 10.19.100.24:62274 van8tc-“ / pcgc / public / Other /”“ * li”-“释放包装单位”

It's quite a large file, and I don't wish to import the entire thing into R mainly because of the amount of space it takes up. 这是一个很大的文件,我主要是因为要占用空间,所以我不希望将整个内容导入R。 Each line has "fields" (what I want to organize and separate) that are designated as the following: 每行都有“字段”(我要组织和分隔的字段),它们指定如下:

  1. F -- identifier of the line F-线的标识符
  2. 20160525 -- date (yyyymmdd) 20160525-日期(yyyymmdd)
  3. 17:52:38.791 -- timestamp (HH:MM:SS.sss) 17:52:38.791-时间戳(HH:MM:SS.sss)
  4. F798259D -- transfer identifier F798259D-转移标识符
  5. 156.145.15.85:46634 -- IP address and related port 156.145.15.85:46634-IP地址和相关端口
  6. xqixh8sl -- username xqixh8sl-用户名
  7. AES -- encryption level (could be - (dash)) AES-加密级别(可以是-(破折号))
  8. "/pcgc...fastq.gz" -- transferred file (in ") “ /pcgc...fastq.gz”-传输的文件(在“”中)
  9. "" -- additional string (should be empty "") “”-附加字符串(应为空“”)
  10. 2951144113 -- transferred bytes 2951144113-传输的字节
  11. (0,0) -- error (0,0)-错误
  12. "2289.47 seconds (10.3 megabits/sec)" -- data about the transfer “ 2289.47秒(10.3兆位/秒)”-有关传输的数据

The only lines I need are the ones that start with F and have a (0, 0) error. 我只需要以F开头并且有(0,0)错误的行。 Here is an example line: 这是示例行:

F 20160525 17:52:38.791 F798259D GET 156.145.15.85:46634 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0053681_HS_EX__1-02598__v1_FCAD18P7ACXX_L8_p92of93_P1.fastq.gz" "" 2951144113 (0,0) "2289.47 seconds (10.3 megabits/sec)" F 20160525 17:52:38.791 F798259D GET 156.145.15.85:46634 xqixh8sl AES“ /pcgc/public/Other/exome/fastq/PCGC0053681_HS_EX__1-02598__v1_FCAD18P7ACXX_L8_p92of93_P1.fastq.gz”(2,0。 /秒)”

And I would NOT consider a line like this: 而且我不会考虑这样的行:

F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES "/pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz" "" 50725464 (4,32) "Remote Application: Session Aborted: Aborted by user interrupt" F 20160602 14:15:48.398 F7982D62 GET 156.145.15.85:36773 xqixh8sl AES“ /pcgc/public/Other/exome/fastq/PCGC0065109_HS_EX__1-04692__v3_FCAD2HMUACXX_L4_p1of1_P2.fastq.gz”,“ Reth:g”“ :被用户中断中止了”

The line above did not have a (0, 0) error, so it would not be considered. 上面的行没有(0,0)错误,因此不会考虑。

My question is this: since the file is so large, I want to able to parse through it and pick out only the lines I need beforehand. 我的问题是:由于文件太大,所以我希望能够解析它并只选择我需要的行。 Then, once I import it, I want the best way to organize it neatly. 然后,一旦导入它,我就想要最好的方法来对其进行整齐地组织。 I know that there are a variety of ways to read the file (I have been trying with readLines() and scan() ) but I don't know how to write in the conditional statement (the line must start with F, and must have a (0, 0) error). 我知道有多种读取文件的方法(我一直在尝试使用readLines()scan() ),但我不知道如何在条件语句中编写(该行必须以F开头,并且必须有(0,0)错误)。

I have tried a variety of things: 我尝试了多种方法:

  1. Used scan() to import the entire file into R as a list. 使用scan()将整个文件作为列表导入到R中。

    x <- scan("dataSet.txt", what = list(lineID = "", date = "", timestamp = "", transferID = "", IP = "", username = "", encryption = "", transferredFile = "", error = "", data = ""), sep = " ", fill = TRUE, strip.white = TRUE)

    logs <- list(x)

    logs

While I liked the numbering and rows, it left out a lot of fields that I needed. 虽然我喜欢编号和行,但它遗漏了很多我需要的字段。 This is the output it gave me: 这是它给我的输出:

[9062] "" [9062]“”
[9063] "" [9063]“”
[9064] "" [9064]“”
[9065] "" [9065]“”
[9066] "" [9066]“”
[9067] "" [9067]“”
[9068] "" [9068]“”
[9069] "" [9069]“”
[9070] "" [9070]“”
[9071] "" [9071]“”
[9072] "" [9072]“”
[9073] "Mnr:0" [9073]“ Mnr:0”
[9074] "" [9074]“”
[9075] "Mnr:0" [9075]“ Mnr:0”
[9076] "" [9076]“”
[9077] "" [9077]“”
[9078] "data" [9078]“数据”
[9079] "" [9079]“”
[9080] "2," [9080]“ 2”
[9081] "12," [9081]“ 12”
[9082] "" [9082]“”
[9083] "" [9083]“”
[9084] "550F919C.60099" [9084]“ 550F919C.60099”

  1. I found as example online of this, so I copied it and tried to use it similarly. 我在网上找到了示例,因此我复制了它并尝试类似地使用它。 However, it did not give me what I desired. 但是,它没有给我我想要的东西。 If someone could explain how this works, that would also be greatly appreciated. 如果有人可以解释它是如何工作的,那也将不胜感激。 However, the way I used it also imported the entire file. 但是,我使用它的方式也导入了整个文件。

> setwd("/Users/kimm5w/Intern Work")

> dataset <- list()

> con <- file("dataSet.txt")

> open(con)

> dataset <- grep("F", scan("dataSet.txt", what = list(lineID = "", date = "", timestamp = "", transferID = "", IP = "", username = "", encryption = "", transferredFile = "", error = "", data = ""), sep = " ", fill = TRUE, strip.white = TRUE), perl = TRUE, value = TRUE)

> dataset

This is the output it gave me, which was not the format I wanted: 这是它给我的输出,不是我想要的格式:

\\"[0]\\", \\"\\", \\"xqixh8sl:\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"Mnr:0\\", \\"\\", \\"Mnr:0\\", \\"\\", \\"\\", \\"data\\", \\"\\", \\"at\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"Mnr:0\\", \\"\\", \\"Mnr:0\\", \\"\\", \\"\\", \\"data\\", \\"\\", \\"at\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"Mnr:0\\", \\"\\", \\"Mnr:0\\", \\"\\", \\"\\", \\"data\\", \\"\\", \\"550F919C.36474\\", \\"\\", \\"550F919C.42385\\", \\"\\", \\"550F919C.49879\\", \\"\\", \\"550F919C.53923\\", \\"\\", \\"6,\\", \\"18,\\", \\"\\", \\"550F919C.36773\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"at\\", \\"\\", \\"\\", \\"\\", \\"\\", \\"\\", \\n\\"\\", \\"\\", \\"550F919C.37525\\", \\"\\", \\"6,\\", \\"18,\\", \\"\\")"

I'm fairly new at R; 我在R还很新。 I learned Java and though the concepts are similar, the syntax is unfamiliar. 我学习了Java,尽管概念相似,但语法并不熟悉。 If anyone can help me with this, please do! 如果有人可以帮助我,请做! I've been working on this for about a week and can't figure it out. 我已经为此工作了大约一个星期,无法解决。 Thank you for your help! 谢谢您的帮助!

UPDATE 更新

Here's what I've tried so far after going through your suggestions: 到目前为止,我在尝试您的建议后已经尝试了以下方法:

    setwd("/Users/kimm5w/Intern Work")
    df<-data.frame(readLines("dataSet.txt"))
    F_dataSet <- grep("^F.*(0,0)", "dataSet.txt")
    F_dataSet

    library(stringr)
    x = 0
    while(x < length(readLines("dataSet.txt"))){
      line <- readLines("dataSet.txt")
      if (str_sub(line, 1, 1) == 'F' & grepl('\\(0\\,0\\)', line)[1]){
        F_data <- c(F_data, line)
        }
    }
    display(F_data)

For some reason when I try and run it in R, it doesn't display the result. 由于某些原因,当我尝试在R中运行它时,它不会显示结果。 However, it does run without error. 但是,它确实运行没有错误。 My question is if one of these will work. 我的问题是这些方法之一是否可行。 I can't use a for loop because the exact number of lines isn't known. 我无法使用for循环,因为确切的行数未知。 So instead, I tried using a while loop in the second version. 因此,我尝试在第二个版本中使用while循环。 The link was helpful, but a bit confusing because I wasn't familiar with the syntax. 该链接很有帮助,但是有点混乱,因为我对语法不熟悉。 If someone could explain each section I think it would be easier to understand. 如果有人可以解释每个部分,我认为它会更容易理解。 On the first attempt, I just tried using grep() to sort out the lines I needed, but I'm not sure if it worked. 第一次尝试时,我只是尝试使用grep()来整理所需的行,但不确定它是否有效。 If anyone can help out from here, that would be very much appreciated. 如果有人可以从这里提供帮助,将不胜感激。 And to those that sent me answers, thank you too. 对于那些给我答复的人,也谢谢您。 This has helped me a lot, and is the most progress I've made in a while. 这对我有很大帮助,这是我一段时间以来取得的最大进步。

Here's another update. 这是另一个更新。 It runs fine, but for some reason the while loop does not print anything. 它运行正常,但是由于某些原因,while循环不会打印任何内容。 F_data does not show up when I try to display it. 当我尝试显示F_data时不显示它。 Could someone point out where the error is? 有人可以指出错误在哪里吗?

    setwd("/Users/kimm5w/Intern Work")
    F_data <- data.frame
    print(F_data)
    library(stringr)
    x <- length(readLines("dataSet.txt"))
    print(x)
    while(x != 0)
      {
      line <- readline("dataSet.txt")
      print(line)
      if (str_sub(line, 1, 1) == 'F' & grepl('\\(0\\,0\\)', line)[1]){
        F_data <- c(F_data, line)
        print(F_data)
      }
      x <- x + 1
    }
    close(con)
    F_data

Perhaps this is a cop out, but if you are concerned about conserving memory during your R session, just don't do it in the R session. 也许这是一个警察出来,但如果你关心你的过程中节省内存R会议,只是不这样做的R会话。 You can just preprocess the file using grep before reading it into R . 您可以使用grep预处理文件, 然后再将其读入R

grep "^F.*(0,0)" dataSet.txt > processed_dataSet.txt

Lets say you read the first line, using the readLines function and a for loop or something else. 假设您使用readLines函数和for循环或其他方法读取了第一行。 Then, you can use a simple search to see if your line start with "F" and if it contains "(0,0)". 然后,您可以使用简单的搜索来查看行是否以“ F”开头以及是否包含“(0,0)”。 For instance: 例如:

library(stringr)
line='F 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" 441 (0,0) "0.10 seconds (36.8 kilobits/sec)" D 20160525 09:51:11.825 F798257E GET 10.19.100.24:62274 van8tc - "/pcgc/public/Other/" "*li" - "Freeing Package Unit"'

if(str_sub(line,1,1)=='F' & grepl('\\(0\\,0\\)', line)[1]){
    relevant_guys<-c(relevant_guys, line)
}

In this way you don't have to put the whole file in memory, and evaluate line by line. 这样,您不必将整个文件放入内存中,也无需逐行评估。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM