无法从python中的xml中提取文本

Question

I have an xml file that comes from a doc (MS Word 2003, so I can't use docx library). 我有一个来自文档的xml文件（MS Word 2003，所以我不能使用docx库）。 I'm using lxml to parse it. 我正在使用lxml进行解析。 I can get most of the text (everything is in <txt> nodes) but there are some nodes with the following structure: 我可以获得大部分文本（所有内容都在<txt>节点中），但是有些节点具有以下结构：

<txt ptr="0x7f6354043000" id="3" symbol="8SwTxtFrm" next="4" upper="2" txtNodeIndex="9">
 <infos>
  <bounds left="1521" top="851" width="10517" height="322"/>
 </infos>
 The text I want to extract    <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2396"/>
 <Text nLength="1" nType="POR_TXT" nHeight="322" nWidth="78"/>
 <Text nLength="42" nType="POR_TXT" nHeight="322" nWidth="5647"/>
 <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2397"/>
 <LineBreak nWidth="10518"/>
 <Finish/>
</txt>

When I iter over the <txt> to extract the text part with: 当我遍历<txt>以提取文本部分时：

for txt in tree.iter('txt'):
    print(txt.text)

I realized that it's the <infos> node that causes the problem. 我意识到是导致问题的原因是<infos>节点。 I tried to remove it: 我试图将其删除：

for elt in tree.iter('txt'):
for info in elt.findall('infos'):
    elt.remove(info)

But this remove the targeted text along with the <infos> node, even though it is outside. 但这会删除目标文本以及<infos>节点，即使该文本位于外部。

Can someone help me understand why? 有人可以帮我理解为什么吗？

Answer 1

根据我对原始帖子的评论，OP通过如下更改xpath解决了该问题

tree.xpath('//text()')

Answer 2

You can extract text this way: 您可以通过以下方式提取文本：

In [31]: txt = """<txt ptr="0x7f6354043000" id="3" symbol="8SwTxtFrm" next="4" upper="2" txtNodeIndex="9">
   ....:  <infos>
   ....:   <bounds left="1521" top="851" width="10517" height="322"/>
   ....:  </infos>
   ....:  The text I want to extract    <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2396"/>
   ....:  <Text nLength="1" nType="POR_TXT" nHeight="322" nWidth="78"/>
   ....:  <Text nLength="42" nType="POR_TXT" nHeight="322" nWidth="5647"/>
   ....:  <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2397"/>
   ....:  <LineBreak nWidth="10518"/>
   ....:  <Finish/>
   ....: </txt>"""

In [32]: node = etree.fromstring(txt)

In [33]: ''.join(node.itertext())
Out[33]: '\n \n  \n \n The text I want to extract    \n \n \n \n \n \n'

UPD: UPD：

Answer suggested by Murali actually returns list , so you still need to join strings. Murali建议的答案实际上返回list ，因此您仍然需要连接字符串。 And my solution is a little bit faster: 我的解决方案要快一些：

In [13]: %timeit ''.join(node.itertext())
100000 loops, best of 3: 11.7 µs per loop

In [14]: %timeit ''.join(node.xpath('//text()'))
10000 loops, best of 3: 26.3 µs per loop

无法从python中的xml中提取文本

问题描述

2 个解决方案

解决方案1
1 2015-03-18 08:53:09

解决方案2
0 已采纳 2015-03-12 10:58:49

无法从python中的xml中提取文本

问题描述

2 个解决方案

解决方案1 1 2015-03-18 08:53:09

解决方案2 0 已采纳 2015-03-12 10:58:49

解决方案1
1 2015-03-18 08:53:09

解决方案2
0 已采纳 2015-03-12 10:58:49