从python中的xml文档中提取文本

Question

This is the sample xml document : 这是示例xml文档：

<bookstore>
    <book category="COOKING">
        <title lang="english">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>300.00</price>
    </book>

    <book category="CHILDREN">
        <title lang="english">Harry Potter</title>
        <author>J K. Rowling </author>
        <year>2005</year>
        <price>625.00</price>
    </book>
</bookstore>

I want to extract the text without specifying the elements how can i do this , because i have 10 such documents. 我想提取文本而不指定元素，我该怎么做，因为我有10个这样的文档。 I want so because my problem is that user is entering some word which I don't know , it has to be searched in all of the 10 xml documents in their respective text portions. 我想要这样做是因为我的问题是用户正在输入一个我不知道的单词，必须在其各自文本部分的所有10个xml文档中进行搜索。 For this to happen I should know where the text lies without knowing about the element. 为此，我应该在不知道元素的情况下知道文本的位置。 One more thing that all these documents are different. 所有这些文档都不同的另一件事。

Please Help!! 请帮忙！！

Answer 1

Using the lxml library with an xpath query is possible: 可以将lxml库与xpath查询一起使用：

xml="""<bookstore>
    <book category="COOKING">
        <title lang="english">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>300.00</price>
    </book>

    <book category="CHILDREN">
        <title lang="english">Harry Potter</title>
        <author>J K. Rowling </author>
        <year>2005</year>
        <price>625.00</price>
    </book>
</bookstore>
"""
from lxml import etree
root = etree.fromstring(xml).getroot()
root.xpath('/bookstore/book/*/text()')
# ['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J K. Rowling ', '2005', '625.00']

Although you don't get the category.... 虽然没有分类...

Answer 2

You could simply strip out any tags: 您可以简单地去除所有标签：

>>> import re
>>> txt = """<bookstore>
...     <book category="COOKING">
...         <title lang="english">Everyday Italian</title>
...         <author>Giada De Laurentiis</author>
...         <year>2005</year>
...         <price>300.00</price>
...     </book>
...
...     <book category="CHILDREN">
...         <title lang="english">Harry Potter</title>
...         <author>J K. Rowling </author>
...         <year>2005</year>
...         <price>625.00</price>
...     </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n        Giada De Laurentiis\n        2005\n        300.00\n
  \n\n    \n        Harry Potter\n        J K. Rowling \n        2005\n        6
25.00'

But if you just want to search files for some text in Linux, you can use grep : 但是，如果您只想在Linux中搜索文件中的某些文本，则可以使用grep ：

burhan@sandbox:~$ grep "Harry Potter" file.xml
        <title lang="english">Harry Potter</title>

If you want to search in a file, use the grep command above, or open the file and search for it in Python: 如果要搜索文件，请使用上面的grep命令，或打开文件并在Python中搜索它：

>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
...     lines = ''.join(line for line in f.readlines())
...     text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
...    print 'It exists'
... else:
...    print 'It does not'
...
It exists

Answer 3

If you want to call grep from inside python, see the discussion here , especially this post. 如果您想从python内部调用grep，请参见此处的讨论，尤其是本文。

If you want to search through all the files in a directory you could try something like this using the glob module: 如果要搜索目录中的所有文件，可以使用glob模块尝试执行以下操作：

import glob    
import os    
import re    

p = re.compile('>.*<')    
os.chdir("./")    
for files in glob.glob("*.xml"):    
    file = open(files, "r")    
    line = file.read()    
    list =  map(lambda x:x.lstrip('>').rstrip('<'), p.findall(line))    
    print list    
    print

This searches iterates through all the files in the directory, opens each file and exteacts text matching the regexp. 此搜索将遍历目录中的所有文件，打开每个文件并显示与正则表达式匹配的文本。

Output: 输出：

['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J
 K. Rowling ', '2005', '625.00']

EDIT : Updated code to extract only the text elements from the xml. 编辑：更新的代码以仅从xml中提取文本元素。

从python中的xml文档中提取文本

问题描述

3 个解决方案

解决方案1
1 2012-07-01 04:57:18

解决方案2
0 已采纳 2012-07-01 04:36:32

解决方案3
0 2012-07-01 04:56:48

从python中的xml文档中提取文本

问题描述

3 个解决方案

解决方案1 1 2012-07-01 04:57:18

解决方案2 0 已采纳 2012-07-01 04:36:32

解决方案3 0 2012-07-01 04:56:48

解决方案1
1 2012-07-01 04:57:18

解决方案2
0 已采纳 2012-07-01 04:36:32

解决方案3
0 2012-07-01 04:56:48