简体   繁体   中英

SAX Parser - Extract string within tags

This is my problem: i need to extract the text between the tag " p " without the XML notation using SAX Parser

    <title>1. Introduction</title>
    <p>The Lorem ipsum 
           <xref ref-type="bibr" rid="B1">
                1
           </xref>. 
           Lorem ipsum 23.
     </p>
     <p>The L domain recruits an ATP-requiring cellular factor for this 
           scission event, the only known energy-dependent step in assembly 
           <xref ref-type="bibr" rid="B2">
                2
           </xref>. 
           Domain is used here to denote the amino 
           acid sequence that constitutes the biological function.
     </p>

Is it possible using endElement() ? Because when i use it i obtain only the part after " /xref " tag

Here the code

public void endElement(String s, String s1, String element) throws SAXException {

        if(element.equals(Finals.PARAGRAPH)){
            Paragraph paragraph = new Paragraph();
            paragraph.setContext(tmpValue);
            System.out.println("Contesto: " + tmpValue);
            listP.add(paragraph);

        }
    }
    @Override
    public void characters(char[] ac, int i, int j) throws SAXException {
        tmpValue = new String(ac, i, j);

    }

This is what i expect to do: a list listP containing the two paragraphs:

1) Lorem ipsum 1 Lorem ipsum 23.
2) The L domain recruits an ATP-requiring cellular factor for this 
       scission event, the only known energy-dependent step in assembly 2 
       Domain is used here to denote the amino 
       acid sequence that constitutes the biological function.

I'm not sure what you mean by "is it possible using endElement", but it's certainly possible. You'd need to write your SAX application so it:

(1) ignores all startElement / endElement events between the ones for the <p> aragraph -- simple state tracking, or perhaps you can simply say that you aren't interested in elements other than paragraphs and make your element event handlers be no-ops for anything you don't care about.

(2) accumulates separately-delivered characters() events until the endElement for the <p> aragraph. But you need to do this anyway, because SAX always reserves the right to deliver contiguous text as several characters() calls, for reasons having to do with parser buffer management.

There are many possible solutions. Usually using SAX parsers you just add some boolean flags to denote some particular states when parsing. In this simple example you can achieve this with just changing this:

tmpValue = new String(ac, i, j);

to this:

if (tmpValue.equals(""))
    tmpValue = new String(ac, i, j);
else
    tmpValue += new String(ac, i, j);

or:

if (tmpValue == null)
    tmpValue = new String(ac, i, j);
else
    tmpValue += new String(ac, i, j);

Depending on how you initialize the tmpValue variable (and you should initialize it if you're not doing it already).

To gather contents of all paragraphs you need to:

public void endElement(String s, String s1, String element) throws SAXException {

    if (element.equals(Finals.PARAGRAPH)) {
        Paragraph paragraph = new Paragraph();
        paragraph.setContext(tmpValue);
        System.out.println("Contesto: " + tmpValue);
        listP.add(paragraph);
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}

and to omit the title part:

public void startElement(
    String uri,
    String localName,
    String qName,
    Attributes atts) {

    if (localName.equals(Finals.PARAGRAPH)) {
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}

Use a stack
Push in startElement events and Pop in endElement events.

Or if that doesn't work for you, just Push into the stack for all events and then after endOfDocument , Pop the elements one by one. Store the data from </p> to <p> in reverse.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM