简体   繁体   中英

huge text file (6Gb) search and replace

I have a huge file (6Gb) with 74.000 articles in this format:

<text id="1">
bla bla bla bla.........
</text>
<text id="2">
bla bla bla bla.........
</text>
<text id="3">
bla bla bla bla.........
</text>
<text id="............ and so on untill 74.000

then I have another file having the title corresponding to each of the id's, like this:

1       title1
2       title2
3       title3
...
74000   title74000

I have to put the corresponding title to each of the id's in the first file so I transformed the second file into this script:

sed -i "s/<text id="1">/<text id="1" title="title1">/" file1
sed -i "s/<text id="2">/<text id="2" title="title2">/" file1
sed -i "s/<text id="3">/<text id="3" title="title3">/" file1
...
sed -i "s/<text id="74000">/<text id="74000" title="title74000">/" file1

Notice I didn't put the g at the end of sed command because it is not global serch, that means at the first match it changes the string and goes to the next search. The script works, but due to the huge size of the file it takes 12 minutes per change, that gives me about two years to complete all the changes while I need them ASAP, so my question is if somebody knows how can I perform this changes in a faster way, maybe with some other utility, python, perls or any other...

In Gnu Awk version 4, you could try:

gawk4 -f a.awk file2 RS="^$" file1

where a.awk is:

NR==FNR {
   b["<text id=\""$1"\">"]=$2
   next
}

{
    n=split($0,a,/<text id=[^>]*>/,s)
    printf "%s%s",s[0],a[1]
    for (i=1; i<n; i++) {
        ind=index(s[i],">")
        printf "%s%s", substr(s[i],1,ind-1) " title=\""b[s[i]]"\">", a[i+1]
    }
    printf "%s",s[n]
}

Output:

<text id="1" title="title1">
  bla bla bla bla.........
</text>
<text id="2" title="title2">
  bla bla bla bla.........
</text>
<text id="3" title="title3">
  bla bla bla bla.........
</text>

Update

Just for fun, I tested some of the solutions here on 3.9Mb xml file (80000 titles) and a 1.3Mb info file (also 80000 titles)

  • @HåkonHægland : 0.629s
  • @tangent : 0.645s
  • @Borodin : 0.718s
  • @glennjackman : 1.098s

(Scripts for generating the input files can be found here: http://pastebin.com/PpTPt0gk )

Update 2

To get more reliable timing results I took an average over 20 runs:

  • @EdMorton : 0.485s (Gnu Awk version 4.1)
  • @EdMorton : 0.528s (Gnu Awk version 3.1.8)
  • @HåkonHægland : 0.589s
  • @Borodin : 0.599s
  • @tangent : 0.626s
  • @glennjackman : 1.074s

I suggest you use something like this.

It reads a line from the titles file every time it comes across a <text> tag in the XML file, and inserts the title attribute into the tag.

It also checks that the IDs in the two files match, and prints a log output every 500 <text> elements so that you can see its progress.

Output is sent to a separate file. You shouldn't overwrite the input file as if something goes wrong you have lost your original data.

This should be only fractionally slower than just copying the XML file.

use strict;
use warnings;

use IO::Handle;

STDOUT->autoflush;

open my $in_xml,    '<', 'input.xml'  or die "Failed to open XML file: $!";
open my $in_titles, '<', 'titles.txt' or die "Failed to open titles file: $!";
open my $out_xml,   '>', 'output.xml' or die "Failed to open output file: $!";

while (my $xml_line = <$in_xml>) {

  if ( $xml_line =~ /<text/ ) {

    my ($id1) = $xml_line =~ /id="(\d+)"/;
    unless (defined $id1) {
      chomp;
      die sprintf qq{Error in input XML file at line %d: %s\n-}, $in_xml->input_line_number, $_;
    }
    printf "Processing ID %d\n", $id1 unless $id1 % 500;

    my $title_line = <$in_titles>;
    my ($id2, $title) = $title_line =~ /^(\d+)\s+(.+)/;
    unless (defined $id2) {
      chomp $title_line;
      die sprintf qq{Error in input titles file at line %d: %s\n-}, $in_titles->input_line_number, $title_line;
    }

    unless ($id1 == $id2) {
      die sprintf "ID mismatch %d <=> %d\nXML file line %d\ntitles file line %d\n-",
          $id1, $id2, $in_xml->input_line_number, $in_titles->input_line_number
    }

    $xml_line =~ s/>/ title="$title">/;
  }

  print $out_xml $xml_line;
}

close $out_xml or die "Failed to close output file: $!";

output

<text id="1" title="title1">
bla bla bla bla.........
</text>
<text id="2" title="title2">
bla bla bla bla.........
</text>
<text id="3" title="title3">
bla bla bla bla.........
</text>
awk '
NR==FNR {
    id = $1
    sub(/^[^[:space:]]+[[:space:]]+/,"")
    map["<text id=\"" id "\">"] = "<text id=\"" id "\" title=\"" $0 "\">"
    next
}
$0 in map { $0 = map[$0] }
1
' file2 file1

If file2 is tab-separated it gets simpler and, I expect, faster:

awk -F'\t' '
NR==FNR {
    map["<text id=\"" $1 "\">"] = "<text id=\"" $1 "\" title=\"" $2 "\">"
    next
}
$0 in map { $0 = map[$0] }
1
' file2 file1

Here's another approach with GNU awk

gawk '
    NR == FNR { title[NR] = $0; next }
    match($0, /<text id="([[:digit:]]+)">/, m) {
        sub(/>/, " title=\"" title[m[1]] "\">")
    }
    {print}
' titles articles

awk keeps 2 counters: FNR is the record number within the current file being processed; NR is the record number of all records processed so far. The condition NR == FNR is true for all records in the first file.

You need GNU awk for the extension tot he match() function, the 3rd parameter which is an array to store the matched portions of the regex.

This might work for you (GNU sed):

sed -r 's|^([0-9]+)\s*(.*)|/(<text id="\1")(>)/s//\\1 title="\2"\\2/|;t;d' file2 |
sed -rf - file1

Runs a sed script against the file holding the titles to produce a sed script to run against the source file.

Beware of metacharacters in the titles!

Here is another Perl version. It first reads all the titles into a hash, then copies each line from your original file to a new file, substituting when necessary.

use strict;
use warnings;

open (my $title_file, '<', 'titles.txt') or die "Could not open titles.txt, $!";
my %titles;
while (<$title_file>) {
    chomp;
    my ($id,$title) = split(m/\s+/,$_,2);
    $titles{$id} = $title;
}
close $title_file;

open (my $in_file, '<', 'in.txt') or die "Could not open in.txt, $!";
open (my $out_file, '>', 'out.txt') or die "Could not open out.txt, $!";
while (<$in_file>) {
    if (m/<text id=/) {
        s/<text id="(\d+)">/<text id="$1" title="$titles{$1}">/;
    }
    print $out_file $_;
}
close $in_file;
close $out_file;

make a first file for your sed instruction from file having peer ID - Title

sed 's|\([0-9]\{1,\}\)[[:blank:]]*\([^[:blank:]].*\)|/<text id="\1"/ {s/>/ title="\2">/\
   b\
   }|' ID_Title_File > /tmp/ID_Chg.sed

2 accelerator (compare to your version)

  1. make all the action in the same sed (no restart of sed per substitution that is time consuming, especially on such number of line)
  2. after a succesull find, replace the end than skip rest of the action for the same line (so no more test for this occurance

treat your huge file from this action list

sed -unbuffer -f /tmp/ID_Chg.sed file1 > Output

For GNU sed you maybe need a -posix option (test made on KSH/AIX)

Just for test purpose:

Increment=$((80000 / 128));echo "" > /tmp/ID_Chg.sed;Iter=0;while [ $Iter -lt 80000 ]; do echo "/id=\"$Iter\""/ b r$Iter >> /tmp/ID_Chg.sed; let Iter+=Increment; done
sed 's|\([0-9]\{1,\}\)[[:blank:]]*\([^[:blank:]].*\)|:r\1\

/^/ title="\\2">/\\ b\\ }|' ID_Title.lst >> /tmp/ID_Chg.sed

where 80000 is the number of ID and 128 number of sub section ("accelerator") wanted

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM