简体   繁体   English

从文本文件中提取表格

[英]extracting table from text file

I am trying to extract tables from text files and have found several earlier posts here that address similar questions.我正在尝试从文本文件中提取表格,并在这里找到了一些解决类似问题的早期帖子。 However, none seem to work efficiently with my problem.但是,似乎没有一个能有效地解决我的问题。 The most helpful answer I have found is to one of my earlier questions here: R: removing header, footer and sporadic column headings when reading csv file我找到的最有用的答案是我之前的一个问题: R:在读取 csv 文件时删除页眉、页脚和零星的列标题

An example dummy text file contains:示例虚拟文本文件包含:

> 
> 
> ###############################################################################
> 
> # Display AICc Table for the models above
> 
> 
> collect.models(, adjust = FALSE)
      model npar  AICc  DeltaAICc weight  Deviance
13      P1   19    94      0.00     0.78      9
12      P2   21    94      2.64     0.20      9
10      P3   15    94      9.44     0.02      9
2       P4   11    94    619.26     0.00      9
> 
> 
> ###############################################################################
> 
> # the three lines below count the number of errors in the code above
> 
> cat("ERROR COUNT:", .error.count, "\n")
ERROR COUNT: 0 
> options(error = old.error.fun)
> rm(.error.count, old.error.fun, new.error.fun)
> 
> ##########
> 
> 

I have written the following code to extract the desired table:我编写了以下代码来提取所需的表:

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')

top    <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'

my.data <- my.data[-c(grep(bottom, my.data):length(my.data))]
my.data <- my.data[-c(1:grep(top, my.data))]
my.data <- my.data[c(1:(length(my.data)-4))]
aa      <- as.data.frame(my.data)
aa

write.table(my.data, 'c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', quote=F, col.names=F, row.name=F)
my.data2 <- read.table('c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', header = TRUE, row.names = c(1))
my.data2
   model npar AICc DeltaAICc weight Deviance
13    P1   19   94      0.00   0.78        9
12    P2   21   94      2.64   0.20        9
10    P3   15   94      9.44   0.02        9
2     P4   11   94    619.26   0.00        9

I would prefer to avoid having to write and then read my.data to obtain the desired data frame.我宁愿避免写入然后读取my.data以获得所需的数据帧。 Prior to that step the current code returns a vector of strings for my.data :在该步骤之前,当前代码为my.data返回一个字符串向量:

[1] "      model npar  AICc  DeltaAICc weight  Deviance" "13      P1   19    94      0.00     0.78      9"   
[3] "12      P2   21    94      2.64     0.20      9"    "10      P3   15    94      9.44     0.02      9"   
[5] "2       P4   11    94    619.26     0.00      9"

Is there some way I can convert the above vector of strings into a data frame like that in dummy.log.extraction.txt without writing and then reading my.data ?有什么方法可以将上述字符串向量转换为像dummy.log.extraction.txt那样的数据帧,而无需写入然后读取my.data

The line:线路:

aa <- as.data.frame(my.data)

returns the following, which looks like what I want:返回以下内容,看起来像我想要的:

#                                              my.data
# 1       model npar  AICc  DeltaAICc weight  Deviance
# 2    13      P1   19    94      0.00     0.78      9
# 3    12      P2   21    94      2.64     0.20      9
# 4    10      P3   15    94      9.44     0.02      9
# 5    2       P4   11    94    619.26     0.00      9

However:然而:

dim(aa)
# [1] 5 1

If I can split aa into columns then I think I will have what I want without having to write and then read my.data .如果我可以将aa分成几列,那么我想我将拥有我想要的东西,而无需编写然后读取my.data

I found the post: Extracting Data from Text Files However, in the posted answer the table in question seems to have a fixed number of rows.我找到了帖子: 从文本文件中提取数据但是,在发布的答案中,有问题的表格似乎有固定的行数。 In my case the number of rows can vary between 1 and 20. Also, I would prefer to use base R .在我的情况下,行数可以在 1 到 20 之间变化。另外,我更喜欢使用base R In my case I think the number of rows between bottom and the last row of the table is a constant (here 4).就我而言,我认为表格bottom和最后一行之间的行数是一个常数(此处为 4)。

I also found the post: How to extract data from a text file using R or PowerShell?我还找到了这篇文章: 如何使用 R 或 PowerShell 从文本文件中提取数据? However, in my case the column widths are not fixed and I do not know how to split the strings (or rows) so there are only seven columns.但是,就我而言,列宽不是固定的,我不知道如何拆分字符串(或行),因此只有七列。

Given all of the above perhaps my question is really how to split the object aa into columns.鉴于以上所有可能,我的问题实际上是如何将对象aa拆分为列。 Thank you for any advice or assistance.感谢您的任何建议或帮助。

EDIT:编辑:

The actual logs are produced by a supercomputer and contain up to 90,000 lines.实际日志由超级计算机生成,最多包含 90,000 行。 However, the number of lines varies greatly among logs.但是,日志之间的行数差异很大。 That is why I was making use of top and bottom .这就是我使用topbottom

read.table and its family now have an option to read text: read.table及其系列现在可以选择阅读文本:

> df <- read.table(text = paste(my.data, collapse = "\n"))
> df
   model npar AICc DeltaAICc weight Deviance
13    P1   19   94      0.00   0.78        9
12    P2   21   94      2.64   0.20        9
10    P3   15   94      9.44   0.02        9
2     P4   11   94    619.26   0.00        9
> summary(df)
 model       npar           AICc      DeltaAICc          weight         Deviance
 P1:1   Min.   :11.0   Min.   :94   Min.   :  0.00   Min.   :0.000   Min.   :9  
 P2:1   1st Qu.:14.0   1st Qu.:94   1st Qu.:  1.98   1st Qu.:0.015   1st Qu.:9  
 P3:1   Median :17.0   Median :94   Median :  6.04   Median :0.110   Median :9  
 P4:1   Mean   :16.5   Mean   :94   Mean   :157.84   Mean   :0.250   Mean   :9  
        3rd Qu.:19.5   3rd Qu.:94   3rd Qu.:161.90   3rd Qu.:0.345   3rd Qu.:9  
        Max.   :21.0   Max.   :94   Max.   :619.26   Max.   :0.780   Max.   :9  

May be your real log file is totally different and more complex but with this one, you can use read.table directly, you just have to play with the right parameters.可能你的真实日志文件完全不同而且更复杂,但是有了这个,你可以直接使用read.table ,你只需要使用正确的参数。

data <- read.table("c:/users/mmiller21/simple R programs/dummy.log",
                   comment.char = ">",
                   nrows = 4,
                   skip = 1,
                   header = TRUE,
                   row.names = 1)

str(data)
## 'data.frame':    4 obs. of  6 variables:
##  $ model    : Factor w/ 4 levels "P1","P2","P3",..: 1 2 3 4
##  $ npar     : int  19 21 15 11
##  $ AICc     : int  94 94 94 94
##  $ DeltaAICc: num  0 2.64 9.44 619.26
##  $ weight   : num  0.78 0.2 0.02 0
##  $ Deviance : int  9 9 9 9

data
##    model npar AICc DeltaAICc weight Deviance
## 13    P1   19   94      0.00   0.78        9
## 12    P2   21   94      2.64   0.20        9
## 10    P3   15   94      9.44   0.02        9
## 2     P4   11   94    619.26   0.00        9

It looks strange that you have to read an R console.您必须阅读 R 控制台,这看起来很奇怪。 Whatever, you can use the fact that your table lines begin with a numeric and extract your inetersting line using something like ^[0-9]+ .无论如何,您可以使用表格行以数字开头的事实,并使用诸如^[0-9]+类的东西提取您的有趣行。 Then read.table like shown by @kohske do the rest.然后read.table喜欢通过@kohske所示做休息。

readLines('c:/users/mmiller21/simple R programs/dummy.log')
idx <- which(grepl('^[0-9]+',ll))
idx <- c(min(idx)-1,idx)   ## header line 
read.table(text=ll[idx])   
 model npar AICc DeltaAICc weight Deviance
13    P1   19   94      0.00   0.78        9
12    P2   21   94      2.64   0.20        9
10    P3   15   94      9.44   0.02        9
2     P4   11   94    619.26   0.00        9

Thank you to those who posted answers.感谢那些发布答案的人。 Because of the size, complexity and variability of the actual log files I think I need to continue to make use of the variables top and bottom .由于实际日志文件的大小、复杂性和可变性,我认为我需要继续使用变量topbottom However, I used elements of dickoa's answer to come up with the following.但是,我使用了 dickoa 的答案的元素来提出以下内容。

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')

top    <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'

my.data <- my.data[-c(grep(bottom, my.data):length(my.data))]
my.data <- my.data[-c(1:grep(top, my.data))]

x <- read.table(text=my.data, comment.char = ">")
x

#    model npar AICc DeltaAICc weight Deviance
# 13    P1   19   94      0.00   0.78        9
# 12    P2   21   94      2.64   0.20        9
# 10    P3   15   94      9.44   0.02        9
# 2     P4   11   94    619.26   0.00        9

Here is even simpler code:下面是更简单的代码:

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')

top    <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'

my.data  <- my.data[grep(top, my.data):grep(bottom, my.data)]

x <- read.table(text=my.data, comment.char = ">")
x

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM