如何计算文本文件Groovy中的匹配块

Question

Say I have a text file that looks like this (including the filename matches part): 假设我有一个看起来像这样的文本文件（包括filename matches部分）：

filename    matches
bugs.txt    5
bugs.txt    3
bugs.txt    12
fish.txt    4
fish.txt    67
birds.txt    34

etc... 等等...

I want to make a new text file, each line of which represents a single filename with the following information: filename, number of times filename appears, sum of matches 我想创建一个新的文本文件，每行代表一个文件名，包含以下信息： filename, number of times filename appears, sum of matches

so the first three lines would read : 所以前三行会是：

bugs.txt    3    20
fish.txt    2    71
birds.txt   1    34

the first line of the original text file (which contains the text filename /t matches is making things hard for me. Any advice? 原始文本文件的第一行（包含文本filename /t matches对我来说很难。有什么建议吗？

Here's my code that doesn't quite do the trick (off by one errors...): 这是我的代码，并没有完全解决这个问题（关闭一个错误......）：

h = null
instances = 0
matches = 0

f.eachLine { line ->

String[] data = line.split (/\t/)

if (line =~ /filename.*/) {}

else {
    source = data[0]  

    if ( source == h) {
        instances ++
        matches = matches + data[9]
    }
    else {
        println h + '\t' + instances + '\t' + matches
        instances = 0   
        matches = 0
        h = source
    }    
} 
}

note: the indices for data[] correspond to the actual text file I'm using 注意：data []的索引对应于我正在使用的实际文本文件

Answer 1

I came up with this (using dummy data) 我想出了这个（使用虚拟数据）

// In reality, you can get this with:
// def text = new File( 'file.txt' ).text
def text = '''filename\tmatches
             |bugs.txt\t5
             |bugs.txt\t3
             |bugs.txt\t12
             |fish.txt\t4
             |fish.txt\t67
             |birds.txt\t34'''.stripMargin()

text.split( /\n|\r|\n\r|\r\n/ ).                                // split based on newline
     drop(1)*.                                                  // drop the header line
     split( /\t/ ).                                             // then split each of these by tab
     collect { [ it[ 0 ], it[ 1 ] as int ] }.                   // convert the second element to int
     groupBy { it[ 0 ] }.                                       // group into a map by filename
     collect { k, v -> [ k, v.size(), v*.getAt( 1 ).sum() ] }*. // then make a list of file,nfiles,sum
     join( '\t' ).                                              // join each of these into a string separated by tab
     each {                                                     // then print them out
       println it
     }

Obviously though, this loads the whole file into memory in one go... 显然，这会将整个文件一次性加载到内存中......

Answer 2

The main problems with your code are: 您的代码的主要问题是：

you're using data[9] when the matches are in column 1 当匹配在第1列时，你正在使用data[9]
you skip updating instances and matches when source == h 在source == h时跳过更新实例和匹配
since you only println when the filename changes, you don't output the results for the last file 因为您只在文件名更改时println ，所以不输出最后一个文件的结果

Here's a simpler implementation that accumulates the results in a map: 这是一个更简单的实现，可以在地图中累积结果：

// this will store a map of filename -> list of matches
// e.g. ['bugs.txt': [5, 3, 12], ...]
def fileMatches = [:].withDefault{[]}

new File('file.txt').eachLine { line ->
    // skip the header line
    if (!(line =~ /filename.*/)) {
        def (source, matches) = line.split (/\t/)
        // append number of matches source's list
        fileMatches[source] << (matches as int)
    }
}
fileMatches.each { source, matches ->
    println "$source\t${matches.size()}\t${matches.sum()}"
}

如何计算文本文件Groovy中的匹配块

问题描述

2 个解决方案

解决方案1
2 2011-12-01 21:44:37

解决方案2
2 已采纳 2011-12-01 21:58:49

如何计算文本文件Groovy中的匹配块

问题描述

2 个解决方案

解决方案1 2 2011-12-01 21:44:37

解决方案2 2 已采纳 2011-12-01 21:58:49

解决方案1
2 2011-12-01 21:44:37

解决方案2
2 已采纳 2011-12-01 21:58:49