简体   繁体   English

如何计算文本文件Groovy中的匹配块

[英]How to count matching blocks in a text file Groovy

Say I have a text file that looks like this (including the filename matches part): 假设我有一个看起来像这样的文本文件(包括filename matches部分):

filename    matches
bugs.txt    5
bugs.txt    3
bugs.txt    12
fish.txt    4
fish.txt    67
birds.txt    34

etc... 等等...

I want to make a new text file, each line of which represents a single filename with the following information: filename, number of times filename appears, sum of matches 我想创建一个新的文本文件,每行代表一个文件名,包含以下信息: filename, number of times filename appears, sum of matches

so the first three lines would read : 所以前三行会是:

bugs.txt    3    20
fish.txt    2    71
birds.txt   1    34

the first line of the original text file (which contains the text filename /t matches is making things hard for me. Any advice? 原始文本文件的第一行(包含文本filename /t matches对我来说很难。有什么建议吗?

Here's my code that doesn't quite do the trick (off by one errors...): 这是我的代码,并没有完全解决这个问题(关闭一个错误......):

h = null
instances = 0
matches = 0

f.eachLine { line ->

String[] data = line.split (/\t/)

if (line =~ /filename.*/) {}

else {
    source = data[0]  

    if ( source == h) {
        instances ++
        matches = matches + data[9]
    }
    else {
        println h + '\t' + instances + '\t' + matches
        instances = 0   
        matches = 0
        h = source
    }    
} 
}

note: the indices for data[] correspond to the actual text file I'm using 注意:data []的索引对应于我正在使用的实际文本文件

I came up with this (using dummy data) 我想出了这个(使用虚拟数据)

// In reality, you can get this with:
// def text = new File( 'file.txt' ).text
def text = '''filename\tmatches
             |bugs.txt\t5
             |bugs.txt\t3
             |bugs.txt\t12
             |fish.txt\t4
             |fish.txt\t67
             |birds.txt\t34'''.stripMargin()

text.split( /\n|\r|\n\r|\r\n/ ).                                // split based on newline
     drop(1)*.                                                  // drop the header line
     split( /\t/ ).                                             // then split each of these by tab
     collect { [ it[ 0 ], it[ 1 ] as int ] }.                   // convert the second element to int
     groupBy { it[ 0 ] }.                                       // group into a map by filename
     collect { k, v -> [ k, v.size(), v*.getAt( 1 ).sum() ] }*. // then make a list of file,nfiles,sum
     join( '\t' ).                                              // join each of these into a string separated by tab
     each {                                                     // then print them out
       println it
     }

Obviously though, this loads the whole file into memory in one go... 显然,这会将整个文件一次性加载到内存中......

The main problems with your code are: 您的代码的主要问题是:

  • you're using data[9] when the matches are in column 1 当匹配在第1列时,你正在使用data[9]
  • you skip updating instances and matches when source == h source == h时跳过更新实例和匹配
  • since you only println when the filename changes, you don't output the results for the last file 因为您只在文件名更改时println ,所以不输出最后一个文件的结果

Here's a simpler implementation that accumulates the results in a map: 这是一个更简单的实现,可以在地图中累积结果:

// this will store a map of filename -> list of matches
// e.g. ['bugs.txt': [5, 3, 12], ...]
def fileMatches = [:].withDefault{[]}

new File('file.txt').eachLine { line ->
    // skip the header line
    if (!(line =~ /filename.*/)) {
        def (source, matches) = line.split (/\t/)
        // append number of matches source's list
        fileMatches[source] << (matches as int)
    }
}
fileMatches.each { source, matches ->
    println "$source\t${matches.size()}\t${matches.sum()}"
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM