简体   繁体   English

使用带有标签的 ElementTree 从 XML 检索文本时遇到问题

[英]Trouble retrieving text from XML with ElementTree with tags

Right now I have some code which uses Biopython and NCBI's "Entrez" API to get XML strings from Pubmed Central.现在我有一些代码使用 Biopython 和 NCBI 的“Entrez”API 从 Pubmed Central 获取 XML 字符串。 I'm trying to parse the XML with ElementTree to just have the text from the page.我正在尝试使用 ElementTree 解析 XML 以获取页面中的文本。 Although I have BeautifulSoup code that does exactly this when I scrape the lxml data from the site itself, I'm switching to the NCBI API since scrapers are apparently a no-no.虽然我有 BeautifulSoup 代码,当我从站点本身抓取 lxml 数据时,它就是这样做的,但我正在切换到 NCBI API,因为抓取工具显然是禁忌。 But now with the XML from the NCBI API, I'm finding ElementTree extremely unintuitive and could really use some help getting it to work.但是现在使用来自 NCBI API 的 XML,我发现 ElementTree 非常不直观,并且真的可以使用一些帮助来让它工作。 Of course I've looked at other posts, but most of these deal with namespaces and in my case, I just want to use the XML tags to grab information.当然,我看过其他帖子,但其中大部分都涉及名称空间,就我而言,我只想使用 XML 标签来获取信息。 Even the ElementTree documentation doesn't go into this (from what I can tell).甚至 ElementTree 文档也没有涉及到这一点(据我所知)。 Can anyone help me figure out the syntax to grab information within certain tags rather than within certain namespaces?谁能帮我找出在某些标签内而不是在某些命名空间内获取信息的语法?

Here's an example.这是一个例子。 Note: I use Python 3.4注意:我使用 Python 3.4

Small snippit of the XML: XML 的小片段:

      <sec sec-type="materials|methods" id="s5">
      <title>Materials and Methods</title>
      <sec id="s5a">
        <title>Overgo design</title>
        <p>In order to screen the saltwater crocodile genomic BAC library described below, four overgo pairs (forward and reverse) were designed (<xref ref-type="table" rid="pone-0114631-t002">Table 2</xref>) using saltwater crocodile sequences of MHC class I and II from previous studies <xref rid="pone.0114631-Jaratlerdsiri1" ref-type="bibr">[40]</xref>, <xref rid="pone.0114631-Jaratlerdsiri3" ref-type="bibr">[42]</xref>. The overgos were designed using OligoSpawn software, with a GC content of 50&#x2013;60% and 36 bp in length (8-bp overlapping) <xref rid="pone.0114631-Zheng1" ref-type="bibr">[77]</xref>. The specificity of the overgos was checked against vertebrate sequences using the basic local alignment search tool (BLAST; <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>).</p>
    <table-wrap id="pone-0114631-t002" orientation="portrait" position="float">
      <object-id pub-id-type="doi">10.1371/journal.pone.0114631.t002</object-id>
      <label>Table 2</label>
      <caption>
        <title>Four pairs of forward and reverse overgos used for BAC library screening of MHC-associated BACs.</title>
      </caption>
      <alternatives>
        <graphic id="pone-0114631-t002-2" xlink:href="pone.0114631.t002"/>
        <table frame="hsides" rules="groups">
          <colgroup span="1">
            <col align="left" span="1"/>
            <col align="center" span="1"/>
          </colgroup>

For my project, I want all of the text in the "p" tag (not just for this snippit of the XML, but for the entire XML string).对于我的项目,我想要“p”标记中的所有文本(不仅针对 XML 的这个片段,还针对整个 XML 字符串)。

Now, I already know that I can make the whole XML string into an ElementTree Object现在,我已经知道我可以将整个 XML 字符串变成一个 ElementTree 对象

>>> import xml.etree.ElementTree as ET
>>> tree = ET.ElementTree(ET.fromstring(xml_string))
>>> root = ET.fromstring(xml_string)

Now if I try to get the text using the tag like this:现在,如果我尝试使用这样的标签获取文本:

 >>> text = root.find('p')
 >>> print("".join(text.itertext()))

or

 >>> text = root.get('p').text

I can't extract the text that I want.我无法提取我想要的文本。 From what I've read, this is because I'm using the tag "p" as an argument rather than a namespace.从我读过的内容来看,这是因为我使用标签“p”作为参数而不是命名空间。

While I feel like it should be quite simple for me to get all the text in "p" tags within an XML file, I'm currently unable to do it.虽然我觉得获取 XML 文件中“p”标签中的所有文本对我来说应该非常简单,但我目前无法做到。 Please let me know what I'm missing and how I can fix this.请让我知道我缺少什么以及如何解决此问题。 Thanks!谢谢!

--- EDIT --- --- 编辑 ---

So now I know that I should be using this code to get everything in the 'p' tags:所以现在我知道我应该使用这段代码来获取“p”标签中的所有内容:

>>> text = root.find('.//p')
>>> print("".join(text.itertext()))

Despite the fact that I'm using itertext(), it's only returning content from the first "p" tag and not looking at any other content.尽管我使用的是 itertext(),但它只从第一个“p”标签返回内容,而不查看任何其他内容。 Does itertext() only iterate within a tag? itertext() 是否只在标签内迭代? Documentation seems to suggest it iterates across all tags as well, so I'm not sure why its only returning one line instead of all of the text under all of the "p" tags.文档似乎表明它也遍历所有标签,所以我不确定为什么它只返回一行而不是所有“p”标签下的所有文本。

---- FINAL EDIT -- ---- 最终编辑-

I figured out that itertext() only works within one tag and find() only returns the first item.我发现 itertext() 只能在一个标签内工作,而 find() 只返回第一项。 In order to get the enitre text that I want I must use findall()为了获得我想要的整个文本,我必须使用 findall()

>>> all_text = root.findall('.//p')
>>> for texts in all_text:
    print("".join(texts.itertext()))

root.get() is the wrong method, as it will retrieve an attribute of the root tag not a subtag. root.get()是错误的方法,因为它将检索根标签的属性而不是子标签。 root.find() is correct as it will find the first matching subtag (alternatively one can use root.findall() for all matching subtags). root.find()是正确的,因为它会找到第一个匹配的子标签(或者,可以对所有匹配的子标签使用root.findall() )。

If you want to find not only direct subtags but also indirect subtags (as in your example), the expression within root.find / root.findall has be to a subset of XPath (see https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support ).如果您不仅要查找直接子标签,还要查找间接子标签(如您的示例中所示),则root.find / root.findall的表达式是 XPath 的子集(请参阅https://docs.python.org/2 /library/xml.etree.elementtree.html#xpath-support )。 In your case it is './/p' :在您的情况下,它是'.//p'

  text = root.find('.//p')
  print("".join(text.itertext()))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM