简体   繁体   中英

Trouble retrieving text from XML with ElementTree with tags

Right now I have some code which uses Biopython and NCBI's "Entrez" API to get XML strings from Pubmed Central. I'm trying to parse the XML with ElementTree to just have the text from the page. Although I have BeautifulSoup code that does exactly this when I scrape the lxml data from the site itself, I'm switching to the NCBI API since scrapers are apparently a no-no. But now with the XML from the NCBI API, I'm finding ElementTree extremely unintuitive and could really use some help getting it to work. Of course I've looked at other posts, but most of these deal with namespaces and in my case, I just want to use the XML tags to grab information. Even the ElementTree documentation doesn't go into this (from what I can tell). Can anyone help me figure out the syntax to grab information within certain tags rather than within certain namespaces?

Here's an example. Note: I use Python 3.4

Small snippit of the XML:

      <sec sec-type="materials|methods" id="s5">
      <title>Materials and Methods</title>
      <sec id="s5a">
        <title>Overgo design</title>
        <p>In order to screen the saltwater crocodile genomic BAC library described below, four overgo pairs (forward and reverse) were designed (<xref ref-type="table" rid="pone-0114631-t002">Table 2</xref>) using saltwater crocodile sequences of MHC class I and II from previous studies <xref rid="pone.0114631-Jaratlerdsiri1" ref-type="bibr">[40]</xref>, <xref rid="pone.0114631-Jaratlerdsiri3" ref-type="bibr">[42]</xref>. The overgos were designed using OligoSpawn software, with a GC content of 50&#x2013;60% and 36 bp in length (8-bp overlapping) <xref rid="pone.0114631-Zheng1" ref-type="bibr">[77]</xref>. The specificity of the overgos was checked against vertebrate sequences using the basic local alignment search tool (BLAST; <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>).</p>
    <table-wrap id="pone-0114631-t002" orientation="portrait" position="float">
      <object-id pub-id-type="doi">10.1371/journal.pone.0114631.t002</object-id>
      <label>Table 2</label>
      <caption>
        <title>Four pairs of forward and reverse overgos used for BAC library screening of MHC-associated BACs.</title>
      </caption>
      <alternatives>
        <graphic id="pone-0114631-t002-2" xlink:href="pone.0114631.t002"/>
        <table frame="hsides" rules="groups">
          <colgroup span="1">
            <col align="left" span="1"/>
            <col align="center" span="1"/>
          </colgroup>

For my project, I want all of the text in the "p" tag (not just for this snippit of the XML, but for the entire XML string).

Now, I already know that I can make the whole XML string into an ElementTree Object

>>> import xml.etree.ElementTree as ET
>>> tree = ET.ElementTree(ET.fromstring(xml_string))
>>> root = ET.fromstring(xml_string)

Now if I try to get the text using the tag like this:

 >>> text = root.find('p')
 >>> print("".join(text.itertext()))

or

 >>> text = root.get('p').text

I can't extract the text that I want. From what I've read, this is because I'm using the tag "p" as an argument rather than a namespace.

While I feel like it should be quite simple for me to get all the text in "p" tags within an XML file, I'm currently unable to do it. Please let me know what I'm missing and how I can fix this. Thanks!

--- EDIT ---

So now I know that I should be using this code to get everything in the 'p' tags:

>>> text = root.find('.//p')
>>> print("".join(text.itertext()))

Despite the fact that I'm using itertext(), it's only returning content from the first "p" tag and not looking at any other content. Does itertext() only iterate within a tag? Documentation seems to suggest it iterates across all tags as well, so I'm not sure why its only returning one line instead of all of the text under all of the "p" tags.

---- FINAL EDIT --

I figured out that itertext() only works within one tag and find() only returns the first item. In order to get the enitre text that I want I must use findall()

>>> all_text = root.findall('.//p')
>>> for texts in all_text:
    print("".join(texts.itertext()))

root.get() is the wrong method, as it will retrieve an attribute of the root tag not a subtag. root.find() is correct as it will find the first matching subtag (alternatively one can use root.findall() for all matching subtags).

If you want to find not only direct subtags but also indirect subtags (as in your example), the expression within root.find / root.findall has be to a subset of XPath (see https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support ). In your case it is './/p' :

  text = root.find('.//p')
  print("".join(text.itertext()))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM