简体   繁体   English

巨大的文本文件(6Gb)搜索和替换

[英]huge text file (6Gb) search and replace

I have a huge file (6Gb) with 74.000 articles in this format: 我有一个巨大的文件(6Gb),这个格式有74.000篇文章:

<text id="1">
bla bla bla bla.........
</text>
<text id="2">
bla bla bla bla.........
</text>
<text id="3">
bla bla bla bla.........
</text>
<text id="............ and so on untill 74.000

then I have another file having the title corresponding to each of the id's, like this: 然后我有另一个文件,其标题对应于每个id,如下所示:

1       title1
2       title2
3       title3
...
74000   title74000

I have to put the corresponding title to each of the id's in the first file so I transformed the second file into this script: 我必须在第一个文件中为每个id添加相应的标题,所以我将第二个文件转换为这个脚本:

sed -i "s/<text id="1">/<text id="1" title="title1">/" file1
sed -i "s/<text id="2">/<text id="2" title="title2">/" file1
sed -i "s/<text id="3">/<text id="3" title="title3">/" file1
...
sed -i "s/<text id="74000">/<text id="74000" title="title74000">/" file1

Notice I didn't put the g at the end of sed command because it is not global serch, that means at the first match it changes the string and goes to the next search. 注意我没有将g放在sed命令的末尾,因为它不是全局serch,这意味着在第一次匹配时它会更改字符串并转到下一个搜索。 The script works, but due to the huge size of the file it takes 12 minutes per change, that gives me about two years to complete all the changes while I need them ASAP, so my question is if somebody knows how can I perform this changes in a faster way, maybe with some other utility, python, perls or any other... 该脚本可以工作,但是由于文件的大小每次更改需要12分钟,这让我有两年的时间来完成所有更改,而我需要它们尽快,所以我的问题是如果有人知道如何执行此更改以更快的方式,也许与其他一些实用程序,python,perls或任何其他...

In Gnu Awk version 4, you could try: 在Gnu Awk第4版中,您可以尝试:

gawk4 -f a.awk file2 RS="^$" file1

where a.awk is: 其中a.awk是:

NR==FNR {
   b["<text id=\""$1"\">"]=$2
   next
}

{
    n=split($0,a,/<text id=[^>]*>/,s)
    printf "%s%s",s[0],a[1]
    for (i=1; i<n; i++) {
        ind=index(s[i],">")
        printf "%s%s", substr(s[i],1,ind-1) " title=\""b[s[i]]"\">", a[i+1]
    }
    printf "%s",s[n]
}

Output: 输出:

<text id="1" title="title1">
  bla bla bla bla.........
</text>
<text id="2" title="title2">
  bla bla bla bla.........
</text>
<text id="3" title="title3">
  bla bla bla bla.........
</text>

Update 更新

Just for fun, I tested some of the solutions here on 3.9Mb xml file (80000 titles) and a 1.3Mb info file (also 80000 titles) 为了好玩,我在3.9Mb xml文件(80000个标题)和1.3Mb信息文件(也是80000个标题)上测试了一些解决方案

  • @HåkonHægland : 0.629s @HåkonHægland:0.629s
  • @tangent : 0.645s @tangent:0.645s
  • @Borodin : 0.718s @Borodin:0.718s
  • @glennjackman : 1.098s @glennjackman:1.098s

(Scripts for generating the input files can be found here: http://pastebin.com/PpTPt0gk ) (生成输入文件的脚本可以在这里找到: http//pastebin.com/PpTPt0gk

Update 2 更新2

To get more reliable timing results I took an average over 20 runs: 为了获得更可靠的计时结果,我平均花费了20多次:

  • @EdMorton : 0.485s (Gnu Awk version 4.1) @EdMorton:0.485s(Gnu Awk 4.1版)
  • @EdMorton : 0.528s (Gnu Awk version 3.1.8) @EdMorton:0.528s(Gnu Awk 3.1.8版)
  • @HåkonHægland : 0.589s @HåkonHægland:0.589s
  • @Borodin : 0.599s @Borodin:0.599s
  • @tangent : 0.626s @tangent:0.626s
  • @glennjackman : 1.074s @glennjackman:1.074s

I suggest you use something like this. 我建议你使用这样的东西。

It reads a line from the titles file every time it comes across a <text> tag in the XML file, and inserts the title attribute into the tag. 每次遇到XML文件中的<text>标记时,它都会从titles文件中读取一行,并将title属性插入到标记中。

It also checks that the IDs in the two files match, and prints a log output every 500 <text> elements so that you can see its progress. 它还会检查两个文件中的ID是否匹配,并每500个<text>元素打印一个日志输出,以便您可以查看其进度。

Output is sent to a separate file. 输出发送到单独的文件。 You shouldn't overwrite the input file as if something goes wrong you have lost your original data. 您不应该覆盖输入文件,就好像出现问题而丢失原始数据一样。

This should be only fractionally slower than just copying the XML file. 这应该只比复制XML文件慢一点。

use strict;
use warnings;

use IO::Handle;

STDOUT->autoflush;

open my $in_xml,    '<', 'input.xml'  or die "Failed to open XML file: $!";
open my $in_titles, '<', 'titles.txt' or die "Failed to open titles file: $!";
open my $out_xml,   '>', 'output.xml' or die "Failed to open output file: $!";

while (my $xml_line = <$in_xml>) {

  if ( $xml_line =~ /<text/ ) {

    my ($id1) = $xml_line =~ /id="(\d+)"/;
    unless (defined $id1) {
      chomp;
      die sprintf qq{Error in input XML file at line %d: %s\n-}, $in_xml->input_line_number, $_;
    }
    printf "Processing ID %d\n", $id1 unless $id1 % 500;

    my $title_line = <$in_titles>;
    my ($id2, $title) = $title_line =~ /^(\d+)\s+(.+)/;
    unless (defined $id2) {
      chomp $title_line;
      die sprintf qq{Error in input titles file at line %d: %s\n-}, $in_titles->input_line_number, $title_line;
    }

    unless ($id1 == $id2) {
      die sprintf "ID mismatch %d <=> %d\nXML file line %d\ntitles file line %d\n-",
          $id1, $id2, $in_xml->input_line_number, $in_titles->input_line_number
    }

    $xml_line =~ s/>/ title="$title">/;
  }

  print $out_xml $xml_line;
}

close $out_xml or die "Failed to close output file: $!";

output 产量

<text id="1" title="title1">
bla bla bla bla.........
</text>
<text id="2" title="title2">
bla bla bla bla.........
</text>
<text id="3" title="title3">
bla bla bla bla.........
</text>
awk '
NR==FNR {
    id = $1
    sub(/^[^[:space:]]+[[:space:]]+/,"")
    map["<text id=\"" id "\">"] = "<text id=\"" id "\" title=\"" $0 "\">"
    next
}
$0 in map { $0 = map[$0] }
1
' file2 file1

If file2 is tab-separated it gets simpler and, I expect, faster: 如果file2是制表符分隔的,那么它会变得更简单,我希望更快:

awk -F'\t' '
NR==FNR {
    map["<text id=\"" $1 "\">"] = "<text id=\"" $1 "\" title=\"" $2 "\">"
    next
}
$0 in map { $0 = map[$0] }
1
' file2 file1

Here's another approach with GNU awk 这是GNU awk的另一种方法

gawk '
    NR == FNR { title[NR] = $0; next }
    match($0, /<text id="([[:digit:]]+)">/, m) {
        sub(/>/, " title=\"" title[m[1]] "\">")
    }
    {print}
' titles articles

awk keeps 2 counters: FNR is the record number within the current file being processed; awk保留2个计数器: FNR是当前正在处理的文件中的记录号; NR is the record number of all records processed so far. NR是迄今为止处理的所有记录的记录号。 The condition NR == FNR is true for all records in the first file. 对于第一个文件中的所有记录,条件NR == FNR为真。

You need GNU awk for the extension tot he match() function, the 3rd parameter which is an array to store the matched portions of the regex. 你需要GNU awk来扩展他的match()函数,第三个参数是一个数组来存储正则表达式的匹配部分。

This might work for you (GNU sed): 这可能适合你(GNU sed):

sed -r 's|^([0-9]+)\s*(.*)|/(<text id="\1")(>)/s//\\1 title="\2"\\2/|;t;d' file2 |
sed -rf - file1

Runs a sed script against the file holding the titles to produce a sed script to run against the source file. 对包含标题的文件运行sed脚本,以生成对源文件运行的sed脚本。

Beware of metacharacters in the titles! 谨防标题中的元字符!

Here is another Perl version. 这是另一个Perl版本。 It first reads all the titles into a hash, then copies each line from your original file to a new file, substituting when necessary. 它首先将所有标题读入哈希,然后将原始文件中的每一行复制到新文件中,必要时替换。

use strict;
use warnings;

open (my $title_file, '<', 'titles.txt') or die "Could not open titles.txt, $!";
my %titles;
while (<$title_file>) {
    chomp;
    my ($id,$title) = split(m/\s+/,$_,2);
    $titles{$id} = $title;
}
close $title_file;

open (my $in_file, '<', 'in.txt') or die "Could not open in.txt, $!";
open (my $out_file, '>', 'out.txt') or die "Could not open out.txt, $!";
while (<$in_file>) {
    if (m/<text id=/) {
        s/<text id="(\d+)">/<text id="$1" title="$titles{$1}">/;
    }
    print $out_file $_;
}
close $in_file;
close $out_file;

make a first file for your sed instruction from file having peer ID - Title 从具有对等ID - 标题的文件中为您的sed指令创建第一个文件

sed 's|\([0-9]\{1,\}\)[[:blank:]]*\([^[:blank:]].*\)|/<text id="\1"/ {s/>/ title="\2">/\
   b\
   }|' ID_Title_File > /tmp/ID_Chg.sed

2 accelerator (compare to your version) 2加速器(与您的版本相比)

  1. make all the action in the same sed (no restart of sed per substitution that is time consuming, especially on such number of line) 在同一个sed中进行所有操作(每次替换不重启sed非常耗时,尤其是在这样的行数上)
  2. after a succesull find, replace the end than skip rest of the action for the same line (so no more test for this occurance 在成功找到之后,替换结束而不是跳过同一行的动作的其余部分(因此不再测试此出现

treat your huge file from this action list 从此操作列表中处理您的大文件

sed -unbuffer -f /tmp/ID_Chg.sed file1 > Output

For GNU sed you maybe need a -posix option (test made on KSH/AIX) 对于GNU sed,您可能需要-posix选项(在KSH / AIX上进行测试)

Just for test purpose: 仅用于测试目的:

Increment=$((80000 / 128));echo "" > /tmp/ID_Chg.sed;Iter=0;while [ $Iter -lt 80000 ]; do echo "/id=\"$Iter\""/ b r$Iter >> /tmp/ID_Chg.sed; let Iter+=Increment; done
sed 's|\([0-9]\{1,\}\)[[:blank:]]*\([^[:blank:]].*\)|:r\1\

/^/ title="\\2">/\\ b\\ }|' / ^ / title =“\\ 2”> / \\ b \\} |' ID_Title.lst >> /tmp/ID_Chg.sed ID_Title.lst >> /tmp/ID_Chg.sed

where 80000 is the number of ID and 128 number of sub section ("accelerator") wanted 其中80000是ID的数量和128个子部分(“加速器”)想要的数量

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM