简体   繁体   English

使用lxml,如何读取嵌套元素中的文本?

[英]Using lxml, how can I read text inside nested elements?

I'm trying to search about 500 XML documents for some specific phrases, and output the ID of any element that contains any of those phrases. 我正在尝试搜索约500个XML文档中的某些特定短语,并输出包含任何这些短语的任何元素的ID。 Currently, this is my code: 目前,这是我的代码:

from lxml import etree
import os
import re

files = os.listdir('C:/Users/Me/Desktop/xml')
search_words = ['House divided', 'Committee divided', 'on Division', 'Division List',
                'The Ayes and the Noes',]

for f in files:
    doc = etree.parse('C:/Users/Me/Desktop/xml/' +f)
    for elem in doc.iter():
        for word in search_words:
            if elem.text is not None and str(elem.attrib) != "{}" and word in elem.text and len(re.findall(r'\d+', elem.text))>1:
                votes = re.findall(r'\d+', elem.text)
                string = str(elem.attrib)[8:-2] + ","
                string += (str(votes[0]) + "," + str(votes[1]) + ",")
                string += word + ","
                string += str(elem.sourceline)
                print string      

Input like this will output properly: 这样的输入将正确输出:

<p id="S3V0001P0-01869">The House divided; Against the Motion 83; For it 23&#x2014;Majority 60.</p>

But input with nested elements like this will be missed, because the text inside is not being parsed for the phrases: 但是将缺少带有这样的嵌套元素的输入,因为内部文本没有针对以下短语进行解析:

<p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER</member><membercontribution> said, that the precedent occurred on the 8th of April, 1850, on a Motion for going into a Committee of Supply. An Amendment was moved by Captain Boldero on the subject of assistant-surgeons in the navy, when, on a division being called for, the Question was put that the words proposed to be left out stand part of the Question. The House divided, when the numbers were&#x2014;Ayes, 40; Noes, 48. The Question, "That the proposed words be added" was put and agreed to; the main Question, as amended, was put and agreed to; and the Question being then put, "That Mr. Speaker do now leave the chair," that Motion was agreed to, and the House went into Committee of Supply.</membercontribution></p>

Is there any way to read the text inside nested elements like this and return its ID? 有没有办法读取这样的嵌套元素中的文本并返回其ID?

With lxml there is an xpath method and XPath has a contains function you can use with eg 使用lxml,有一个xpath方法,XPath有一个contains功能,可以与例如一起使用

doc = ET.fromstring('<p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER</member><membercontribution> said, that the precedent occurred on the 8th of April, 1850, on a Motion for going into a Committee of Supply. An Amendment was moved by Captain Boldero on the subject of assistant-surgeons in the navy, when, on a division being called for, the Question was put that the words proposed to be left out stand part of the Question. The House divided, when the numbers were&#x2014;Ayes, 40; Noes, 48. The Question, "That the proposed words be added" was put and agreed to; the main Question, as amended, was put and agreed to; and the Question being then put, "That Mr. Speaker do now leave the chair," that Motion was agreed to, and the House went into Committee of Supply.</membercontribution></p>')
result = doc.xpath('//*[@id and contains(., $word)]', word = 'House divided')

You could use some XPath and extract all the text elements below whatever's interesting. 您可以使用一些XPath并提取所有有趣的内容下面的所有文本元素。 I like Parsel : pip install parsel . 我喜欢Parselpip install parsel

import parsel

data = ('<x><y><z><p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER'
        '</member><membercontribution> said, that the precedent occurred on the '
        '8th of April, 1850, on a Motion ...</membercontribution></p></z></y></x>')

selector = parsel.Selector(data)

for para in selector.xpath('//p'):
    id = para.xpath('@id').extract_first()
    texts = para.xpath('*/text()').extract()
    for text in texts:
        # do whatever search
        print(id, len(text), 'April' in text)

Output: 输出:

S3V0141P0-01248 31 False
S3V0141P0-01248 77 True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM