简体   繁体   English

如何使用perl识别和检查文本?

[英]how do i identify and check text using perl?

I have xml data like this 我有像这样的xml数据

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//ES//DTD journal  article DTD version 5.2.0//EN//XML" "art520.dtd" [<!ENTITY mmc1 SYSTEM "mmc1" NDATA APPLICATION><!ENTITY mmc2 SYSTEM "mmc2" NDATA APPLICATION>]><article docsubtype="fla">    <item-info><jid>JURO</jid><aid>10407</aid><ce:pii>S0022-5347(13)04374-7</item-info><ce:inter-ref xlink:href="http://download.journals.elsevierhealth.com/mmcs/journals/0022-5347/PIIS0022534713059089.mmc2.pdf" id="intref0010"><ce:bold>Abstract</ce:bold></ce:inter-ref><ce:inter-ref xlink:href="http://download.journals.elsevierhealth.com/mmcs/journals/0022-5347/PIIS0022534713059089.mmc1.pdf" id="intref0010">   <ce:bold>Abstract</ce:bold></ce:inter-ref>

There is a link is given for abstract text, this link generated by EITIY declaration "MMC\\d" and data in 给出了抽象文本的链接,这个链接由EITIY声明“MMC \\ d”和数据生成

<pii>..</pii>

nw i check given link is correct or not by collecting two data from this code which eitiy declaration "MMC1".. and data in nw我通过收集这个代码的两个数据来检查给定的链接是否正确,这些数据是eitiy声明“MMC1”..和数据

<pii></pii>

in my code i collect this data. 在我的代码中我收集这些数据。 In this code first is mmc1 and S0022-5347(13)04374-7 and i remove '-' '(' ')' this in pii variable and add mmc1 into that. 在这段代码中,首先是mmc1和S0022-5347(13)04374-7,我在pii变量中删除' - ''('')'并将mmc1添加到其中。 after i make like this "PIIS0022534713059089.mmc1" and check. 在我做这样的“PIIS0022534713059089.mmc1”并检查。

nw the problem is 问题是什么

the abtract text coming two times in a file so first abstract should contain "PIIS0022534713059089.mmc1" and second abstract should contain "PIIS0022534713059089.mmc2" but if given wrongly first abstract text "PIIS0022534713059089.mmc2" and second abstract text "PIIS0022534713059089.mmc1" we identify and tell to user. abtract文本在文件中出现两次,因此第一个摘要应包含“PIIS0022534713059089.mmc1”,第二个摘要应包含“PIIS0022534713059089.mmc2”,但如果错误地给出第一个抽象文本“PIIS0022534713059089.mmc2”和第二个抽象文本“PIIS0022534713059089.mmc1”我们识别并告诉用户。

my code is 我的代码是

#!/usr/bin/perl  

print "start..";

@files = <*.xml>;

open my $out, '>', 'output.xml' or die $!;

foreach $file (@files) {

    open (FILE, "$file");

    while (my $line = <FILE>) {
        if ($line =~ /(<ce:pii>)(.*)(<\/ce:pii>)/) {
            $pii = $2;
            $pii =~ s/\-//g;
            $pii =~ s/\(//;
            $pii =~ s/\)//;
        }
        if ($line =~ /\"(mmc)([1-5]{1})\"/) {
            my $count = $1 . $2;
        }
        if ($line =~ /$pii\.$count/) {
            print ".";
        }
        else {
            print $out("$file = wrong\n");
        }
    }
}

It sounds like you're asking how to make sure that each abstract appears in ascending order -- starting with .mmc1 , then .mmc2 , then .mmc3 , and so on. 听起来你在问如何确保每个摘要按升序显示 - 从.mmc1开始,然后是.mmc2 ,然后是.mmc3 ,依此类推。 I'm going to guess also that the file may contain lots of different abstracts, so ABSTRACT1.mmc3 might be followed by ABSTRACT2.mmc1 and that should not signal an error. 我还要猜测该文件可能包含许多不同的摘要,因此ABSTRACT1.mmc3可能后跟ABSTRACT2.mmc1并且不应该发出错误信号。 Right? 对?

You can use a hash to keep track of the most recent mmc number seen for a particular abstract. 您可以使用哈希来跟踪特定摘要的最新mmc数。 Then when you see a new mmc for that abstract, you can check it against the one you previously saw. 然后,当您看到该抽象的新mmc ,您可以根据之前看到的那个进行检查。 Like this: 像这样:

open (FILE, "$file");
my %last_mmc;
while (my $line = <FILE>) {
    ...
    if ($line =~ /"mmc([1-5])"/) {
        # Check that the mmc number is correct.
        $last_mmc{$pii} ||= 0;
        warn "incorrect mmc: $pii.$line" if ($2 != $last_mmc{$pii} + 1);
        $last_mmc{$pii} = $2;
        my $count = "mmc$2";
    }
    ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM