简体   繁体   中英

how do i identify and check text using perl?

I have xml data like this

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//ES//DTD journal  article DTD version 5.2.0//EN//XML" "art520.dtd" [<!ENTITY mmc1 SYSTEM "mmc1" NDATA APPLICATION><!ENTITY mmc2 SYSTEM "mmc2" NDATA APPLICATION>]><article docsubtype="fla">    <item-info><jid>JURO</jid><aid>10407</aid><ce:pii>S0022-5347(13)04374-7</item-info><ce:inter-ref xlink:href="http://download.journals.elsevierhealth.com/mmcs/journals/0022-5347/PIIS0022534713059089.mmc2.pdf" id="intref0010"><ce:bold>Abstract</ce:bold></ce:inter-ref><ce:inter-ref xlink:href="http://download.journals.elsevierhealth.com/mmcs/journals/0022-5347/PIIS0022534713059089.mmc1.pdf" id="intref0010">   <ce:bold>Abstract</ce:bold></ce:inter-ref>

There is a link is given for abstract text, this link generated by EITIY declaration "MMC\\d" and data in

<pii>..</pii>

nw i check given link is correct or not by collecting two data from this code which eitiy declaration "MMC1".. and data in

<pii></pii>

in my code i collect this data. In this code first is mmc1 and S0022-5347(13)04374-7 and i remove '-' '(' ')' this in pii variable and add mmc1 into that. after i make like this "PIIS0022534713059089.mmc1" and check.

nw the problem is

the abtract text coming two times in a file so first abstract should contain "PIIS0022534713059089.mmc1" and second abstract should contain "PIIS0022534713059089.mmc2" but if given wrongly first abstract text "PIIS0022534713059089.mmc2" and second abstract text "PIIS0022534713059089.mmc1" we identify and tell to user.

my code is

#!/usr/bin/perl  

print "start..";

@files = <*.xml>;

open my $out, '>', 'output.xml' or die $!;

foreach $file (@files) {

    open (FILE, "$file");

    while (my $line = <FILE>) {
        if ($line =~ /(<ce:pii>)(.*)(<\/ce:pii>)/) {
            $pii = $2;
            $pii =~ s/\-//g;
            $pii =~ s/\(//;
            $pii =~ s/\)//;
        }
        if ($line =~ /\"(mmc)([1-5]{1})\"/) {
            my $count = $1 . $2;
        }
        if ($line =~ /$pii\.$count/) {
            print ".";
        }
        else {
            print $out("$file = wrong\n");
        }
    }
}

It sounds like you're asking how to make sure that each abstract appears in ascending order -- starting with .mmc1 , then .mmc2 , then .mmc3 , and so on. I'm going to guess also that the file may contain lots of different abstracts, so ABSTRACT1.mmc3 might be followed by ABSTRACT2.mmc1 and that should not signal an error. Right?

You can use a hash to keep track of the most recent mmc number seen for a particular abstract. Then when you see a new mmc for that abstract, you can check it against the one you previously saw. Like this:

open (FILE, "$file");
my %last_mmc;
while (my $line = <FILE>) {
    ...
    if ($line =~ /"mmc([1-5])"/) {
        # Check that the mmc number is correct.
        $last_mmc{$pii} ||= 0;
        warn "incorrect mmc: $pii.$line" if ($2 != $last_mmc{$pii} + 1);
        $last_mmc{$pii} = $2;
        my $count = "mmc$2";
    }
    ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM