[英]Joining two files with regular expression in Unix (ideally with perl)
I have following two files disconnect.txt and answered.txt:我有以下两个文件disconnect.txt和answered.txt:
disconnect.txt断开连接.txt
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032
answered.txt回答.txt
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt.我想根据 from: 和 to: 字段在这些文件上创建一个连接,并且 output 应该是 answer.txt 中的匹配字段。 For example, in the above two files, the output would be:
例如,在上述两个文件中,output 将是:
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032
I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).我目前正在通过将文件 1 中的每一行与文件 2 中的每一行进行比较来做到这一点,但想知道是否存在有效的方法(这些文件将以数十 GB 为单位)。
Thank you谢谢
Sounds like you have hundreds of millions of lines?听起来你有数亿行?
Unless the files are sorted in such a way that you can expect the order of the from: and to: to at least vaguely correlate, this is a job for a database.除非文件的排序方式使您可以期望 from: 和 to: 的顺序至少模糊相关,否则这是数据库的工作。
If the files are large the quadratic algorithm will take a lifetime.如果文件很大,二次算法将需要一个生命周期。
Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:这是一个 Ruby 脚本,它只使用了一个 hash 表在 answered.txt 中的每行查找:
def key s
s.split('from:')[1].split('to:').map(&:strip).join('.')
end
h = {}
open 'disconnect.txt', 'r' do |f|
while s = f.gets
h[key(s)] = true
end
end
open 'answered.txt', 'r' do |f|
while a = f.gets
puts a if h[key(a)]
end
end
Like ysth says, it all depends on the number of lines in disconnect.txt.就像ysth所说,这完全取决于 disconnect.txt 中的行数。 If that's a really big 1 number, then you will probably not be able to fit all the keys in memory and you will need a database.
如果这是一个非常大的1数字,那么您可能无法容纳 memory 中的所有键,并且您将需要一个数据库。
1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine. 1.disconnect.txt中的行数乘以(大约)64应该小于你机器中memory的数量。
First, sort the files on the from/to timestamps if they are not already sorted that way.首先,如果文件尚未以这种方式排序,则对从/到时间戳上的文件进行排序。 (Yes, I know the from/to appear to be stored as epoch seconds, but that's still a timestamp.)
(是的,我知道从/到似乎存储为纪元秒,但这仍然是一个时间戳。)
Then take the sorted files and compare the first lines of each.然后取出排序的文件并比较每个文件的第一行。
This is the fastest way to compare two (or more) sorted files and it guarantees that no line will be read from disk more than once.这是比较两个(或更多)已排序文件的最快方法,它保证不会从磁盘中多次读取任何行。
If your files aren't appropriately sorted, then the initial sorting operation may be somewhat expensive on files in the "tens of gigabytes each" size range, but:如果您的文件未正确排序,那么对于“每个数十 GB”大小范围内的文件,初始排序操作可能会有些昂贵,但是:
Or you could just use a database as mentioned in previous answers.或者您可以只使用前面答案中提到的数据库。 The above method will be more efficient in most, if not all, cases, but a database-based solution would be easier to write and would also provide a lot of flexibility for analyzing your data in other ways without needing to do a complete scan through each file every time you need to access anything in it.
上述方法在大多数(如果不是全部)情况下会更有效,但是基于数据库的解决方案会更容易编写,并且还可以为以其他方式分析数据提供很大的灵活性,而无需进行完整的扫描每次您需要访问其中的任何内容时,每个文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.