使用lxml解析时，方括号上的字符串断开

Question

我是lxml解析的新手，无法管理简单的解析问题。 我的xml中的一行看起来像：

The IgM BCR is essential for survival of peripheral B cells [<xref ref-type="bibr" rid="CR34">34</xref>]. In the absence of BTK B cell...

所以，当我执行以下代码时：

e = open('somexml.xml', encoding='utf8')

tree = etree.parse(e)

titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')

for node in titles:
    text = tree.xpath('/pmc-articleset/article/body/sec/p')

    for node in text:
        content = str(node.text).encode("utf-8")
        s = str(' '.join(lxml.html.fromstring(content).xpath("//text()")).encode('latin1'))
        print (s)

结果如下：

The IgM BCR is essential for survival of peripheral B cells ['

即使我只打印node.text而没有任何“join”命令，结果看起来也很相似。

如何跳过方括号部分并收到完整的字符串？ 任何帮助将不胜感激！

Answer 1

]. In the absence of BTK B cell... ]. In the absence of BTK B cell...是<xref>元素的tail属性的值。 请参阅http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html 。

方括号没有什么特别之处; 他们只是人物。

使用itertext()您可以获取元素及其后代的文本内容。 tail内容默认包含在内。 请参阅http://lxml.de/api/lxml.etree._Element-class.html#itertext 。

小演示：

from lxml import etree

xml = "<p>TEXT <xref>34</xref>TAIL</p>"
p = etree.fromstring(xml)

print(p.text)
print(''.join(p.itertext()))
print(p.text + p.find("xref").tail)

输出：

TEXT 
TEXT 34TAIL
TEXT TAIL

Answer 2

尝试这些方面的东西：

e = open('somexml.xml', encoding='utf8')

tree = etree.parse(e)

titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')

for title in titles:
    ps = title.xpath('/pmc-articleset/article/body/sec/p')

    for p in ps:
        text = ''.join(p.itertext())
        print(text)

使用lxml解析时，方括号上的字符串断开

问题描述

2 个解决方案

解决方案1
3 2018-04-19 14:48:08

解决方案2
0 已采纳 2018-04-19 13:50:06

使用lxml解析时，方括号上的字符串断开

问题描述

2 个解决方案

解决方案1 3 2018-04-19 14:48:08

解决方案2 0 已采纳 2018-04-19 13:50:06

解决方案1
3 2018-04-19 14:48:08

解决方案2
0 已采纳 2018-04-19 13:50:06