简体   繁体   English

在 Unix 中使用正则表达式连接两个文件(最好使用 perl)

[英]Joining two files with regular expression in Unix (ideally with perl)

I have following two files disconnect.txt and answered.txt:我有以下两个文件disconnect.txt和answered.txt:

disconnect.txt断开连接.txt

2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 40397400012 to:40397400032
2011-07-08 00:59:06,363 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:459 - AnalyzedInfo had ActCode = Disconnected from: 4035350012 to:40677400032

answered.txt回答.txt

2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 40397643433 to:403###34**
2011-07-08 00:59:40,706 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2301986 from: 3455334459 to:1222
2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

I would like to create a join on these files based on the from: and to: fields and the output should be matching field from answered.txt.我想根据 from: 和 to: 字段在这些文件上创建一个连接,并且 output 应该是 answer.txt 中的匹配字段。 For example, in the above two files, the output would be:例如,在上述两个文件中,output 将是:

2011-07-08 00:59:48,893 [socketProcessor] DEBUG ProbeEventDetectorIS41Impl:404 - Normal Call Answered, billingid=2220158 from: 4035350012 to:40677400032

I'm currently doing it by comparing each line in file 1 with each line in file 2, but want to know if an efficient way exists (these files will be in tens of gigabytes).我目前正在通过将文件 1 中的每一行与文件 2 中的每一行进行比较来做到这一点,但想知道是否存在有效的方法(这些文件将以数十 GB 为单位)。

Thank you谢谢

Sounds like you have hundreds of millions of lines?听起来你有数亿行?

Unless the files are sorted in such a way that you can expect the order of the from: and to: to at least vaguely correlate, this is a job for a database.除非文件的排序方式使您可以期望 from: 和 to: 的顺序至少模糊相关,否则这是数据库的工作。

If the files are large the quadratic algorithm will take a lifetime.如果文件很大,二次算法将需要一个生命周期。

Here is a Ruby script that uses just a single hash table lookup per line in answered.txt:这是一个 Ruby 脚本,它只使用了一个 hash 表在 answered.txt 中的每行查找:

def key s
  s.split('from:')[1].split('to:').map(&:strip).join('.')
end

h = {}
open 'disconnect.txt', 'r' do |f|
  while s = f.gets
    h[key(s)] = true
  end
end

open 'answered.txt', 'r' do |f|
  while a = f.gets
    puts a if h[key(a)]
  end
end

Like ysth says, it all depends on the number of lines in disconnect.txt.就像ysth所说,这完全取决于 disconnect.txt 中的行数。 If that's a really big 1 number, then you will probably not be able to fit all the keys in memory and you will need a database.如果这是一个非常大的1数字,那么您可能无法容纳 memory 中的所有键,并且您将需要一个数据库。


1. The number of lines in disconnect.txt multiplied by (roughly) 64 should be less than the amount of memory in your machine. 1.disconnect.txt中的行数乘以(大约)64应该小于你机器中memory的数量。

First, sort the files on the from/to timestamps if they are not already sorted that way.首先,如果文件尚未以这种方式排序,则对从/到时间戳上的文件进行排序。 (Yes, I know the from/to appear to be stored as epoch seconds, but that's still a timestamp.) (是的,我知道从/到似乎存储为纪元秒,但这仍然是一个时间戳。)

Then take the sorted files and compare the first lines of each.然后取出排序的文件并比较每个文件的第一行。

  • If the timestamps are the same, you have a match.如果时间戳相同,则您有匹配项。 Hooray.万岁。 Advance a line in one or both files (depending on your rules for duplicate timestamps in each) and compare again.在一个或两个文件中推进一行(取决于每个文件中重复时间戳的规则)并再次比较。
  • If not, grab the next line in whichever file has the earlier timestamp and compare again.如果不是,请获取具有较早时间戳的文件中的下一行并再次比较。

This is the fastest way to compare two (or more) sorted files and it guarantees that no line will be read from disk more than once.这是比较两个(或更多)已排序文件的最快方法,它保证不会从磁盘中多次读取任何行。

If your files aren't appropriately sorted, then the initial sorting operation may be somewhat expensive on files in the "tens of gigabytes each" size range, but:如果您的文件未正确排序,那么对于“每个数十 GB”大小范围内的文件,初始排序操作可能会有些昂贵,但是:

  1. You can split the files into arbitrarily-sized chunks (ideally small enough for each chunk to fit into memory), sort each chunk independently, and then generalize the above algorithm from two files to as many as are necessary.您可以将文件拆分为任意大小的块(理想情况下,每个块都小到足以放入内存),独立地对每个块进行排序,然后将上述算法从两个文件推广到所需的任意数量。
  2. Even if you don't do that and you deal with the disk thrashing involved with sorting files larger than the available memory, sorting and then doing a single pass over each file will still be a lot faster than any solution involving a cartesian join.即使您不这样做并且您处理排序文件所涉及的磁盘抖动,该文件大于可用的 memory,排序然后对每个文件执行单次传递仍然任何涉及笛卡尔连接的解决方案快得多。

Or you could just use a database as mentioned in previous answers.或者您可以只使用前面答案中提到的数据库。 The above method will be more efficient in most, if not all, cases, but a database-based solution would be easier to write and would also provide a lot of flexibility for analyzing your data in other ways without needing to do a complete scan through each file every time you need to access anything in it.上述方法在大多数(如果不是全部)情况下会更有效,但是基于数据库的解决方案会更容易编写,并且还可以为以其他方式分析数据提供很大的灵活性,而无需进行完整的扫描每次您需要访问其中的任何内容时,每个文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM