簡體   English   中英

如何檢查xml文件是否包含連續節點?

[英]How to check if a xml file contains consecutive nodes?

我有一些看起來像的xml文件

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="jats-html.xsl"?>
<!--<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd">-->
<article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id" />
<journal-title-group>
<journal-title>Eleventh &#x0026; Tenth International Conference on Correlation Optics</journal-title>
</journal-title-group>
<issn pub-type="epub">0277-786X</issn>
<publisher>
<publisher-name>Springer</publisher-name>
</publisher>
</journal-meta>
<fig-count count="0" />
<table-count count="0" />
<equation-count count="0" />
</front>
<body>
<sec id="s1">
<label>a.</label>
<title>INTRODUCTION</title>
<p>One of approaches of solving<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref>, <xref ref-type="bibr" rid="ref8">[8]</xref> the problem <xref ref-type="bibr" rid="ref1">[1]</xref>, <xref ref-type="bibr" rid="ref5">[2]</xref>, <xref ref-type="bibr" rid="ref6">[6]</xref> <xref ref-type="bibr" rid="ref7">[6]</xref> of light propagation in scattering media is the method of Monte Carlo statistical simulation<sup><xref ref-type="bibr" rid="c1">1</xref>–<xref ref-type="bibr" rid="c5">5</xref></sup>. It is a set of techniques that allow us to find the necessary solutions by repetitive random sampling. Estimates of the unknown quantities are statistical means.</p>
<p>For the case of radiation transport in scattering <xref ref-type="bibr" rid="ref6">6</xref> <xref ref-type="bibr" rid="ref8">8</xref> <xref ref-type="bibr" rid="ref9">9</xref> <xref ref-type="bibr" rid="ref10">10</xref> medium Monte Carlo method consists in repeated calculation of the trajectory <xref ref-type="bibr" rid="ref7">6</xref> <xref ref-type="bibr" rid="ref7">7</xref> <xref ref-type="bibr" rid="ref8">8</xref> <xref ref-type="bibr" rid="ref9">[9]</xref> of a photon in a medium based on defined environment parameters. Application of Monte Carlo method is based on the use of macroscopic optical properties of the medium which are considered homogeneous within small volumes of tissue. Models that are based on this method can be divided into two types: models that take into account the polarization of the radiation, and models that ignore it.</p>
<p>Simulation that is based on the previous models usually discards the details of the radiation energy distribution within a single scattering particle. This disadvantage can be ruled out (in the case of scattering particles whose size exceeds the wavelength) by using another method - reverse ray tracing. This method is like the one mentioned before on is based on passing a large number of photons through a medium that is simulated. The difference is that now each scattering particle has a certain geometric topology and scattering is now calculated using the Fresnel equations. The disadvantage of this method is that it can give reliable results only if the particle size is much greater than the wavelength (at least an order of magnitude).</p>
</sec>
</body>
</article>

其中存在<xref ref-type="bibr" rid="ref...">...</xref>形式的鏈接節點。 如何查找是否有3個或更多連續的鏈接節點(用逗號和空格分隔,或者只是文件中的空格 ,並將它們輸出到txt文件。

我可以進行正則表達式搜索,如(?:<xref type="bibr" rid="ref\\d+">\\[\\d+\\]</xref>\\s*,\\s*){2,}<xref type="bibr" rid="ref\\d+">\\[\\d+\\]</xref>這將找到3個或更多由“,SPACE”或“SPACE”分隔的鏈接節點,但它們不一定必須連續ID的。 我該怎么做呢?

因此,為了符合您的要求,我在此向您提出我的問題解決方案。 我沒有徹底測試重復的可能性。 即某些參考可能只是前一個結果的一個子集。 但要解決它們應該沒問題。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Text.RegularExpressions;


public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.Load("article.xml");

    //only selects <p>'s that already have 3 or more refs. No need to check paragraphs that don't even have enough refs
    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

    List<string> results = new List<string>();

    //Foreach <p>
    foreach (XmlNode x in nodes)
    {
        XmlNodeList xrefs = x.SelectNodes(".//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
        List<StartEnd> startEndOfEachTag = new List<StartEnd>(); // we mark the start and end of each ref.
        string temp = x.OuterXml; //the paragraph we're checking

        //finds start and end of each tag xref tag
        foreach (XmlNode xN in xrefs){ //We find the start and end of each paragraph
            StartEnd se = new StartEnd(temp.IndexOf(xN.OuterXml), temp.IndexOf(xN.OuterXml) + xN.OuterXml.Length);
            startEndOfEachTag.Add(se);  
        }

        /* This comment shows the regex command used and how we build the regular expression we are checking with.
        string regexTester = Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref>")+"([ ]|(, ))" + Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref3\">3</xref>");
        Match matchTemp = Regex.Match("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref> <xref ref-type=\"bibr\" rid=\"ref3\">3</xref>", regexTester);
        Console.WriteLine(matchTemp.Value);*/

        //we go through all the xrefs
        for (int i=0; i<xrefs.Count; i++)
        {
            int newIterator = i; //This iterator prevents us from creating duplicates.
            string regCompare = Regex.Escape(xrefs[i].OuterXml); // The start xref

            int count = 1; //we got one xref to start with we need at least 3
            string tempRes = ""; //the string we store the result in

            int consecutive = Int32.Parse(xrefs[i].Attributes["rid"].Value.Substring(3));

            for (int j=i+1; j<xrefs.Count; j++) //we check with the other xrefs to see if they follow immediately after.
            {
                if(consecutive == Int32.Parse(xrefs[j].Attributes["rid"].Value.Substring(3)) - 1)
                {
                    consecutive++;
                }
                else { break; }

                regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space
                regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space

                Match matchReg;

                try
                {
                    matchReg = Regex.Match(temp.Substring(startEndOfEachTag[i].start, startEndOfEachTag[j].end - startEndOfEachTag[i].start),
                        regCompare); //we get the result
                }
                catch
                {
                    i = j; // we failed and i should start from here now.
                    break;
                }

                if (matchReg.Success){
                    count++; //it was a success so we increment the number of xrefs we matched
                    tempRes = matchReg.Value; // we add it to out temporary result.
                    newIterator = j; //update where i should start from next time.
                }
                else {
                    i = j; // we failed and i should start from here now.
                    break;
                }
            }
            i = newIterator;
            if (count > 2)
            {
                results.Add(tempRes); 
            }
        }
    }
    Console.WriteLine("Results: ");
    foreach(string s in results)
    {
            Console.WriteLine(s+"\n");
    }

    Console.ReadKey();
}

缺少的課程

class StartEnd
{
    public int start=-1;
    public int end = -1;

    public StartEnd(int start, int end)
    {
        this.start = start;
        this.end = end;
    }
}

我的xpath有點生疏了。 但我相信你可以制作一個比我下面提到的更好的xpath。 更好的xpath只會選擇具有3個或更多bibr類型的節點的節點,並且包含以ref開頭的rid。 任何誰。 這是我獲取所需節點的解決方案。

public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.Load("article.xml");

    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//xref[@ref-type='bibr' and starts-with(@rid,'ref')]/parent::*");

    foreach(XmlNode x in nodes)
    {
        XmlNodeList temp = x.SelectNodes("//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
        //we only select those that have 3 or more references.
        if (temp.Count >= 3)
        {
            Console.WriteLine(x.InnerText);
        }
    }

    Console.ReadKey();

}

編輯我玩了一下,下面的代碼有一個更新的xpath,它應該得到你想要的一切。

public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.Load("article.xml");

    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

    foreach(XmlNode x in nodes){
        Console.WriteLine(x.InnerText);
    }

    Console.ReadKey();

}

正則表達式對於分層語法並不是很好。 我會編寫C#代碼來讀取XML並跟蹤僅由“,”或“”分隔的連續外部參照點的數量。

  static void Main(string[] args)
  {
     using (var xmlStream = System.Reflection.Assembly.GetExecutingAssembly().GetManifestResourceStream("ConsoleApp1.XMLFile1.xml"))
     {
        int state = 0; // 0 = Look for xref; 1 = look for separator
        string[] simpleSeparators = { " ", ", " };
        string rid = "0";
        StringBuilder nodeText = new StringBuilder();
        string[] consecutiveNodes = new string[3];

        System.Xml.XmlReaderSettings settings = new System.Xml.XmlReaderSettings();
        settings.DtdProcessing = System.Xml.DtdProcessing.Ignore;
        using (var reader = System.Xml.XmlReader.Create(xmlStream, settings))
        {
           while (reader.Read())
           {
              if (reader.IsStartElement("xref"))
              {
                 nodeText.Append("<xref");
                 if (reader.HasAttributes)
                 {
                    while (reader.MoveToNextAttribute())
                       nodeText.AppendFormat(" {0}=\"{1}\"", reader.Name, reader.Value);
                 }
                 nodeText.Append(">");
                 string nextRid = reader.GetAttribute("rid");
                 switch (state)
                 {
                    case 0:
                       break;
                    case 2:
                    case 4:
                       if (Math.Abs(GetIndex(nextRid) - GetIndex(rid)) > 1)
                          state = 0;
                       break;
                 }
                 state++;
                 rid = nextRid;
              }
              else if (reader.NodeType == System.Xml.XmlNodeType.Text)
              {
                 if (state > 0)
                    nodeText.Append(reader.Value);
                 if ((state % 2 == 1) && simpleSeparators.Contains(reader.Value))
                       state++;
              }
              else if ((reader.NodeType == System.Xml.XmlNodeType.EndElement) && (state > 0))
              {
                 nodeText.AppendFormat("</{0}>", reader.Name);
                 consecutiveNodes[state / 2] = nodeText.ToString();
                 nodeText.Clear();
                 if (state > 3)
                 {
                    Console.WriteLine("{0}{1}{2}", consecutiveNodes[0], consecutiveNodes[1], consecutiveNodes[2]);
                    state = 0;
                 }
              }
              else if (reader.IsStartElement())
              {
                 nodeText.Clear();
                 state = 0;
              }
           }
        }
     }
  }

  static int GetIndex(string rid)
  {
     int start = rid.Length;
     while ((start > 0) && Char.IsDigit(rid, --start)) ;

     start++;
     if (start < rid.Length)
        return int.Parse(rid.Substring(start));
     return 0;
  }

此示例在您的示例數據輸出上運行:

<xref ref-type="bibr" rid="ref2">[2]</xref>, <xref ref-type="bibr" rid="ref3">[3]</xref>, <xref ref-type="bibr" rid="ref4">[4]</xref>
<xref ref-type="bibr" rid="rid6">6</xref><xref ref-type="bibr" rid="rid6">9</xref><xref ref-type="bibr" rid="rid6">10</xref>

我更新了代碼以排除:

<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref>, <xref ref-type="bibr" rid="ref8">[8]</xref>

因為ref11,ref13和ref8不是你問題中要求的連續id。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM