[英]How to count matching blocks in a text file Groovy
Say I have a text file that looks like this (including the filename matches
part): 假设我有一个看起来像这样的文本文件(包括filename matches
部分):
filename matches
bugs.txt 5
bugs.txt 3
bugs.txt 12
fish.txt 4
fish.txt 67
birds.txt 34
etc... 等等...
I want to make a new text file, each line of which represents a single filename with the following information: filename, number of times filename appears, sum of matches
我想创建一个新的文本文件,每行代表一个文件名,包含以下信息: filename, number of times filename appears, sum of matches
so the first three lines would read : 所以前三行会是:
bugs.txt 3 20
fish.txt 2 71
birds.txt 1 34
the first line of the original text file (which contains the text filename /t matches
is making things hard for me. Any advice? 原始文本文件的第一行(包含文本filename /t matches
对我来说很难。有什么建议吗?
Here's my code that doesn't quite do the trick (off by one errors...): 这是我的代码,并没有完全解决这个问题(关闭一个错误......):
h = null
instances = 0
matches = 0
f.eachLine { line ->
String[] data = line.split (/\t/)
if (line =~ /filename.*/) {}
else {
source = data[0]
if ( source == h) {
instances ++
matches = matches + data[9]
}
else {
println h + '\t' + instances + '\t' + matches
instances = 0
matches = 0
h = source
}
}
}
note: the indices for data[] correspond to the actual text file I'm using 注意:data []的索引对应于我正在使用的实际文本文件
I came up with this (using dummy data) 我想出了这个(使用虚拟数据)
// In reality, you can get this with:
// def text = new File( 'file.txt' ).text
def text = '''filename\tmatches
|bugs.txt\t5
|bugs.txt\t3
|bugs.txt\t12
|fish.txt\t4
|fish.txt\t67
|birds.txt\t34'''.stripMargin()
text.split( /\n|\r|\n\r|\r\n/ ). // split based on newline
drop(1)*. // drop the header line
split( /\t/ ). // then split each of these by tab
collect { [ it[ 0 ], it[ 1 ] as int ] }. // convert the second element to int
groupBy { it[ 0 ] }. // group into a map by filename
collect { k, v -> [ k, v.size(), v*.getAt( 1 ).sum() ] }*. // then make a list of file,nfiles,sum
join( '\t' ). // join each of these into a string separated by tab
each { // then print them out
println it
}
Obviously though, this loads the whole file into memory in one go... 显然,这会将整个文件一次性加载到内存中......
The main problems with your code are: 您的代码的主要问题是:
data[9]
when the matches are in column 1 当匹配在第1列时,你正在使用data[9]
source == h
在source == h
时跳过更新实例和匹配 println
when the filename changes, you don't output the results for the last file 因为您只在文件名更改时println
,所以不输出最后一个文件的结果 Here's a simpler implementation that accumulates the results in a map: 这是一个更简单的实现,可以在地图中累积结果:
// this will store a map of filename -> list of matches
// e.g. ['bugs.txt': [5, 3, 12], ...]
def fileMatches = [:].withDefault{[]}
new File('file.txt').eachLine { line ->
// skip the header line
if (!(line =~ /filename.*/)) {
def (source, matches) = line.split (/\t/)
// append number of matches source's list
fileMatches[source] << (matches as int)
}
}
fileMatches.each { source, matches ->
println "$source\t${matches.size()}\t${matches.sum()}"
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.