[英]Cannot extract text from xml in python
I have an xml file that comes from a doc (MS Word 2003, so I can't use docx library). 我有一个来自文档的xml文件(MS Word 2003,所以我不能使用docx库)。 I'm using lxml to parse it. 我正在使用lxml进行解析。 I can get most of the text (everything is in <txt>
nodes) but there are some nodes with the following structure: 我可以获得大部分文本(所有内容都在<txt>
节点中),但是有些节点具有以下结构:
<txt ptr="0x7f6354043000" id="3" symbol="8SwTxtFrm" next="4" upper="2" txtNodeIndex="9">
<infos>
<bounds left="1521" top="851" width="10517" height="322"/>
</infos>
The text I want to extract <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2396"/>
<Text nLength="1" nType="POR_TXT" nHeight="322" nWidth="78"/>
<Text nLength="42" nType="POR_TXT" nHeight="322" nWidth="5647"/>
<Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2397"/>
<LineBreak nWidth="10518"/>
<Finish/>
</txt>
When I iter over the <txt>
to extract the text part with: 当我遍历<txt>
以提取文本部分时:
for txt in tree.iter('txt'):
print(txt.text)
I realized that it's the <infos>
node that causes the problem. 我意识到是导致问题的原因是<infos>
节点。 I tried to remove it: 我试图将其删除:
for elt in tree.iter('txt'):
for info in elt.findall('infos'):
elt.remove(info)
But this remove the targeted text along with the <infos>
node, even though it is outside. 但这会删除目标文本以及<infos>
节点,即使该文本位于外部。
Can someone help me understand why? 有人可以帮我理解为什么吗?
根据我对原始帖子的评论,OP通过如下更改xpath
解决了该问题
tree.xpath('//text()')
You can extract text this way: 您可以通过以下方式提取文本:
In [31]: txt = """<txt ptr="0x7f6354043000" id="3" symbol="8SwTxtFrm" next="4" upper="2" txtNodeIndex="9">
....: <infos>
....: <bounds left="1521" top="851" width="10517" height="322"/>
....: </infos>
....: The text I want to extract <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2396"/>
....: <Text nLength="1" nType="POR_TXT" nHeight="322" nWidth="78"/>
....: <Text nLength="42" nType="POR_TXT" nHeight="322" nWidth="5647"/>
....: <Special nLength="0" nType="POR_MARGIN" rText="" nWidth="2397"/>
....: <LineBreak nWidth="10518"/>
....: <Finish/>
....: </txt>"""
In [32]: node = etree.fromstring(txt)
In [33]: ''.join(node.itertext())
Out[33]: '\n \n \n \n The text I want to extract \n \n \n \n \n \n'
UPD: UPD:
Answer suggested by Murali actually returns list
, so you still need to join strings. Murali建议的答案实际上返回list
,因此您仍然需要连接字符串。 And my solution is a little bit faster: 我的解决方案要快一些:
In [13]: %timeit ''.join(node.itertext())
100000 loops, best of 3: 11.7 µs per loop
In [14]: %timeit ''.join(node.xpath('//text()'))
10000 loops, best of 3: 26.3 µs per loop
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.