简体   繁体   English

从python中的xml文档中提取文本

[英]extract text from xml documents in python

This is the sample xml document : 这是示例xml文档:

<bookstore>
    <book category="COOKING">
        <title lang="english">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>300.00</price>
    </book>

    <book category="CHILDREN">
        <title lang="english">Harry Potter</title>
        <author>J K. Rowling </author>
        <year>2005</year>
        <price>625.00</price>
    </book>
</bookstore>

I want to extract the text without specifying the elements how can i do this , because i have 10 such documents. 我想提取文本而不指定元素,我该怎么做,因为我有10个这样的文档。 I want so because my problem is that user is entering some word which I don't know , it has to be searched in all of the 10 xml documents in their respective text portions. 我想要这样做是因为我的问题是用户正在输入一个我不知道的单词,必须在其各自文本部分的所有10个xml文档中进行搜索。 For this to happen I should know where the text lies without knowing about the element. 为此,我应该在不知道元素的情况下知道文本的位置。 One more thing that all these documents are different. 所有这些文档都不同的另一件事。

Please Help!! 请帮忙!!

Using the lxml library with an xpath query is possible: 可以将lxml库与xpath查询一起使用:

xml="""<bookstore>
    <book category="COOKING">
        <title lang="english">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>300.00</price>
    </book>

    <book category="CHILDREN">
        <title lang="english">Harry Potter</title>
        <author>J K. Rowling </author>
        <year>2005</year>
        <price>625.00</price>
    </book>
</bookstore>
"""
from lxml import etree
root = etree.fromstring(xml).getroot()
root.xpath('/bookstore/book/*/text()')
# ['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J K. Rowling ', '2005', '625.00']

Although you don't get the category.... 虽然没有分类...

You could simply strip out any tags: 您可以简单地去除所有标签:

>>> import re
>>> txt = """<bookstore>
...     <book category="COOKING">
...         <title lang="english">Everyday Italian</title>
...         <author>Giada De Laurentiis</author>
...         <year>2005</year>
...         <price>300.00</price>
...     </book>
...
...     <book category="CHILDREN">
...         <title lang="english">Harry Potter</title>
...         <author>J K. Rowling </author>
...         <year>2005</year>
...         <price>625.00</price>
...     </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n        Giada De Laurentiis\n        2005\n        300.00\n
  \n\n    \n        Harry Potter\n        J K. Rowling \n        2005\n        6
25.00'

But if you just want to search files for some text in Linux, you can use grep : 但是,如果您只想在Linux中搜索文件中的某些文本,则可以使用grep

burhan@sandbox:~$ grep "Harry Potter" file.xml
        <title lang="english">Harry Potter</title>

If you want to search in a file, use the grep command above, or open the file and search for it in Python: 如果要搜索文件,请使用上面的grep命令,或打开文件并在Python中搜索它:

>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
...     lines = ''.join(line for line in f.readlines())
...     text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
...    print 'It exists'
... else:
...    print 'It does not'
...
It exists

If you want to call grep from inside python, see the discussion here , especially this post. 如果您想从python内部调用grep,请参见此处的讨论,尤其是本文

If you want to search through all the files in a directory you could try something like this using the glob module: 如果要搜索目录中的所有文件,可以使用glob模块尝试执行以下操作:

import glob    
import os    
import re    

p = re.compile('>.*<')    
os.chdir("./")    
for files in glob.glob("*.xml"):    
    file = open(files, "r")    
    line = file.read()    
    list =  map(lambda x:x.lstrip('>').rstrip('<'), p.findall(line))    
    print list    
    print 

This searches iterates through all the files in the directory, opens each file and exteacts text matching the regexp. 此搜索将遍历目录中的所有文件,打开每个文件并显示与正则表达式匹配的文本。

Output: 输出:

['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J
 K. Rowling ', '2005', '625.00']

EDIT : Updated code to extract only the text elements from the xml. 编辑 :更新的代码以仅从xml中提取文本元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM