简体   繁体   English

按区块中的单个令牌对大型(〜10GB)中的多行区块进行排序

[英]Sort multi-line blocks in large (~10GB) by single token in block

I have a large file (~10GB) full of memory traces in this format: 我有一个很大的文件(〜10GB),它充满了这种格式的内存跟踪:

INPUT: 输入:

Address: 7f2da282c000
Data:
0x7f2da282c000
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

Address: 603000
Data:
0x603000
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

.
.
.

Address: 7f2da2a38dc0
Data:
0x7f2da2a38dc0
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

These are addresses and 64 bytes of data at those addresses at different points in time as the accesses occurred. 这些是访问发生时在不同时间点的地址和这些地址处的64字节数据。 Each hex value in the data field represents 8 bytes. 数据字段中的每个十六进制值代表8个字节。 Suppose each address and its data make up one multi-line block. 假设每个地址及其数据组成一个多行块。

Certain addresses are accessed/updated multiple times and I'd like to sort the multi-line blocks so that each address that has multiple updates, has those accesses right below it like this: 某些地址被多次访问/更新,我想对多行块进行排序,以便每个具有多个更新的地址都具有如下访问权限:

OUTPUT: 输出:

Address: 7f2da2a38dc0
Data:
0x7f2da2a38dc0
0
0x7f2db4c810d0
0
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528
0x304b2e198
0x304b2e1b8
0x304b3af38
0x304b54528

. 
.
.

0x7f2da2a38dc0
0
0x7f2db4c810d0
0
0x7f2da2a38dc0
0
0x7f2db4c810d0
0

Address: 0xadsf212
Data: 
[Updates]

[Updates]
. 
. 
.
[Updates]

Where each address that is accessed more than once, has its respective updates below it, and addresses that are accessed only once are thrown out. 每个访问多次的地址下面都有其各自的更新,并且仅访问一次的地址被抛出。

What I tried: 我试过的

-Comparing each address to every other address in a simple c++ program, but it's way too slow, (has been running for a couple days now). -在一个简单的c ++程序中将每个地址与其他每个地址进行比较,但是速度太慢了(已经运行了几天)。

-Used *nix sort to get all the addresses and their counts (sort -k 2,2 bigTextFile.txt | uniq -cd > output file), but only the first line of the multi-line blocks are sorted by, the deadbeeff part in 'Address: deadbeeff' and the data blocks are left behind. -使用* nix排序可获取所有地址及其计数(排序-k 2,2 bigTextFile.txt | uniq -cd>输出文件),但仅多行块的第一行按deadbeeff部分排序在“地址:deadbeeff”中,数据块被留下。 Is there any way for sort to take a set of lines and sort them from a single value in the top line of the block, ie the address value and move the entire block around? 有没有一种排序的方法,可以采用一组线并从块顶行中的单个值(即地址值)中进行排序,并将整个块移来移去? I found some awk scripts that looked not applicable. 我发现一些看起来不适用的awk脚本。

-Looked into making a database out of the file with address, the access index and the data as three columns and then run a query for all the data updates that have the same address, but I've never used databases and I'm not sure this is the best approach. -尝试使用地址,访问索引和数据作为三列从文件中创建数据库,然后对所有具有相同地址的数据更新运行查询,但是我从未使用过数据库,并且我没有确保这是最好的方法。

Any recommendations on what I tried, or new approaches is appreciated. 对于我尝试过的任何建议或新方法,我们将不胜感激。

This is pretty basic file processing. 这是非常基本的文件处理。 It sounds like you just need to hash the blocks on address and then print the map values that have more than one block. 听起来您只需要对地址上的块进行哈希处理,然后打印具有多个块的地图值即可。 In languages like perl this is simple: 在像perl这样的语言中,这很简单:

use strict;

sub read_block {
  my @data;
  while (<>) {
    s/^Address: //; # Remove "Address: ".
    return \@data unless /\S/;
    push @data, $_ unless /^Data/; # Ignore "Data:".
  }
  \@data
}

sub main {
  my %map;
  while (1) {
    my $block = read_block;
    last unless scalar(@$block) > 0;
    my $addr = shift @$block;  # Add the block to the hash.
    push @{$map{$addr}}, $block;
  }
  # Just for fun, sort keys by address.
  my @sorted_addr = sort { hex $a cmp hex $b } keys %map;
  # Print blocks that have more than one access.
  foreach my $addr (@sorted_addr) {
    next unless scalar(@{$map{$addr}}) > 1; # Ignore blocks of 1.
    print "Address: $addr";
    foreach my $block (@{$map{$addr}}) {
      print @$block;
      print "\n";  # Leave a blank line between blocks.
    }
  }
}

main;

Of course you'll need a machine with enough RAM to hold the data. 当然,您需要一台具有足够RAM的计算机来保存数据。 32Gb ought to do nicely. 32Gb应该做得很好。 If you don't have that, a trickier 2-pass algorithm will do with much less. 如果您不具备此功能,那么棘手的2次遍历算法将减少很多工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM