I have xml data like this
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//ES//DTD journal article DTD version 5.2.0//EN//XML" "art520.dtd" [<!ENTITY mmc1 SYSTEM "mmc1" NDATA APPLICATION><!ENTITY mmc2 SYSTEM "mmc2" NDATA APPLICATION>]><article docsubtype="fla"> <item-info><jid>JURO</jid><aid>10407</aid><ce:pii>S0022-5347(13)04374-7</item-info><ce:inter-ref xlink:href="http://download.journals.elsevierhealth.com/mmcs/journals/0022-5347/PIIS0022534713059089.mmc2.pdf" id="intref0010"><ce:bold>Abstract</ce:bold></ce:inter-ref><ce:inter-ref xlink:href="http://download.journals.elsevierhealth.com/mmcs/journals/0022-5347/PIIS0022534713059089.mmc1.pdf" id="intref0010"> <ce:bold>Abstract</ce:bold></ce:inter-ref>
There is a link is given for abstract text, this link generated by EITIY declaration "MMC\\d" and data in
<pii>..</pii>
nw i check given link is correct or not by collecting two data from this code which eitiy declaration "MMC1".. and data in
<pii></pii>
in my code i collect this data. In this code first is mmc1 and S0022-5347(13)04374-7 and i remove '-' '(' ')' this in pii variable and add mmc1 into that. after i make like this "PIIS0022534713059089.mmc1" and check.
nw the problem is
the abtract text coming two times in a file so first abstract should contain "PIIS0022534713059089.mmc1" and second abstract should contain "PIIS0022534713059089.mmc2" but if given wrongly first abstract text "PIIS0022534713059089.mmc2" and second abstract text "PIIS0022534713059089.mmc1" we identify and tell to user.
my code is
#!/usr/bin/perl
print "start..";
@files = <*.xml>;
open my $out, '>', 'output.xml' or die $!;
foreach $file (@files) {
open (FILE, "$file");
while (my $line = <FILE>) {
if ($line =~ /(<ce:pii>)(.*)(<\/ce:pii>)/) {
$pii = $2;
$pii =~ s/\-//g;
$pii =~ s/\(//;
$pii =~ s/\)//;
}
if ($line =~ /\"(mmc)([1-5]{1})\"/) {
my $count = $1 . $2;
}
if ($line =~ /$pii\.$count/) {
print ".";
}
else {
print $out("$file = wrong\n");
}
}
}
It sounds like you're asking how to make sure that each abstract appears in ascending order -- starting with .mmc1
, then .mmc2
, then .mmc3
, and so on. I'm going to guess also that the file may contain lots of different abstracts, so ABSTRACT1.mmc3
might be followed by ABSTRACT2.mmc1
and that should not signal an error. Right?
You can use a hash to keep track of the most recent mmc
number seen for a particular abstract. Then when you see a new mmc
for that abstract, you can check it against the one you previously saw. Like this:
open (FILE, "$file");
my %last_mmc;
while (my $line = <FILE>) {
...
if ($line =~ /"mmc([1-5])"/) {
# Check that the mmc number is correct.
$last_mmc{$pii} ||= 0;
warn "incorrect mmc: $pii.$line" if ($2 != $last_mmc{$pii} + 1);
$last_mmc{$pii} = $2;
my $count = "mmc$2";
}
...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.