Sort multi-line blocks in large (~10GB) by single token in block

Question

I have a large file (~10GB) full of memory traces in this format:

INPUT:

Address: 7f2da282c000
Data:
0x7f2da282c000
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

Address: 603000
Data:
0x603000
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

.
.
.

Address: 7f2da2a38dc0
Data:
0x7f2da2a38dc0
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

These are addresses and 64 bytes of data at those addresses at different points in time as the accesses occurred. Each hex value in the data field represents 8 bytes. Suppose each address and its data make up one multi-line block.

Certain addresses are accessed/updated multiple times and I'd like to sort the multi-line blocks so that each address that has multiple updates, has those accesses right below it like this:

OUTPUT:

Address: 7f2da2a38dc0
Data:
0x7f2da2a38dc0
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

. 
.
.

0x7f2da2a38dc0
0
0x7f2db4c810d0
0
0x7f2da2a38dc0
0
0x7f2db4c810d0
0

Address: 0xadsf212
Data: 
[Updates]

[Updates]
. 
. 
.
[Updates]

Where each address that is accessed more than once, has its respective updates below it, and addresses that are accessed only once are thrown out.

What I tried:

-Comparing each address to every other address in a simple c++ program, but it's way too slow, (has been running for a couple days now).

-Used *nix sort to get all the addresses and their counts (sort -k 2,2 bigTextFile.txt | uniq -cd > output file), but only the first line of the multi-line blocks are sorted by, the deadbeeff part in 'Address: deadbeeff' and the data blocks are left behind. Is there any way for sort to take a set of lines and sort them from a single value in the top line of the block, ie the address value and move the entire block around? I found some awk scripts that looked not applicable.

-Looked into making a database out of the file with address, the access index and the data as three columns and then run a query for all the data updates that have the same address, but I've never used databases and I'm not sure this is the best approach.

Any recommendations on what I tried, or new approaches is appreciated.

Answer 1

This is pretty basic file processing. It sounds like you just need to hash the blocks on address and then print the map values that have more than one block. In languages like perl this is simple:

use strict;

sub read_block {
  my @data;
  while (<>) {
    s/^Address: //; # Remove "Address: ".
    return \@data unless /\S/;
    push @data, $_ unless /^Data/; # Ignore "Data:".
  }
  \@data
}

sub main {
  my %map;
  while (1) {
    my $block = read_block;
    last unless scalar(@$block) > 0;
    my $addr = shift @$block;  # Add the block to the hash.
    push @{$map{$addr}}, $block;
  }
  # Just for fun, sort keys by address.
  my @sorted_addr = sort { hex $a cmp hex $b } keys %map;
  # Print blocks that have more than one access.
  foreach my $addr (@sorted_addr) {
    next unless scalar(@{$map{$addr}}) > 1; # Ignore blocks of 1.
    print "Address: $addr";
    foreach my $block (@{$map{$addr}}) {
      print @$block;
      print "\n";  # Leave a blank line between blocks.
    }
  }
}

main;

Of course you'll need a machine with enough RAM to hold the data. 32Gb ought to do nicely. If you don't have that, a trickier 2-pass algorithm will do with much less.

Sort multi-line blocks in large (~10GB) by single token in block

Question

1 answers

solution1
0 ACCPTED 2015-07-31 04:53:43

Sort multi-line blocks in large (~10GB) by single token in block

Question

1 answers

solution1 0 ACCPTED 2015-07-31 04:53:43

solution1
0 ACCPTED 2015-07-31 04:53:43