簡體   English   中英

使用lxml,如何讀取嵌套元素中的文本?

[英]Using lxml, how can I read text inside nested elements?

我正在嘗試搜索約500個XML文檔中的某些特定短語,並輸出包含任何這些短語的任何元素的ID。 目前,這是我的代碼:

from lxml import etree
import os
import re

files = os.listdir('C:/Users/Me/Desktop/xml')
search_words = ['House divided', 'Committee divided', 'on Division', 'Division List',
                'The Ayes and the Noes',]

for f in files:
    doc = etree.parse('C:/Users/Me/Desktop/xml/' +f)
    for elem in doc.iter():
        for word in search_words:
            if elem.text is not None and str(elem.attrib) != "{}" and word in elem.text and len(re.findall(r'\d+', elem.text))>1:
                votes = re.findall(r'\d+', elem.text)
                string = str(elem.attrib)[8:-2] + ","
                string += (str(votes[0]) + "," + str(votes[1]) + ",")
                string += word + ","
                string += str(elem.sourceline)
                print string      

這樣的輸入將正確輸出:

<p id="S3V0001P0-01869">The House divided; Against the Motion 83; For it 23&#x2014;Majority 60.</p>

但是將缺少帶有這樣的嵌套元素的輸入,因為內部文本沒有針對以下短語進行解析:

<p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER</member><membercontribution> said, that the precedent occurred on the 8th of April, 1850, on a Motion for going into a Committee of Supply. An Amendment was moved by Captain Boldero on the subject of assistant-surgeons in the navy, when, on a division being called for, the Question was put that the words proposed to be left out stand part of the Question. The House divided, when the numbers were&#x2014;Ayes, 40; Noes, 48. The Question, "That the proposed words be added" was put and agreed to; the main Question, as amended, was put and agreed to; and the Question being then put, "That Mr. Speaker do now leave the chair," that Motion was agreed to, and the House went into Committee of Supply.</membercontribution></p>

有沒有辦法讀取這樣的嵌套元素中的文本並返回其ID?

使用lxml,有一個xpath方法,XPath有一個contains功能,可以與例如一起使用

doc = ET.fromstring('<p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER</member><membercontribution> said, that the precedent occurred on the 8th of April, 1850, on a Motion for going into a Committee of Supply. An Amendment was moved by Captain Boldero on the subject of assistant-surgeons in the navy, when, on a division being called for, the Question was put that the words proposed to be left out stand part of the Question. The House divided, when the numbers were&#x2014;Ayes, 40; Noes, 48. The Question, "That the proposed words be added" was put and agreed to; the main Question, as amended, was put and agreed to; and the Question being then put, "That Mr. Speaker do now leave the chair," that Motion was agreed to, and the House went into Committee of Supply.</membercontribution></p>')
result = doc.xpath('//*[@id and contains(., $word)]', word = 'House divided')

您可以使用一些XPath並提取所有有趣的內容下面的所有文本元素。 我喜歡Parselpip install parsel

import parsel

data = ('<x><y><z><p id="S3V0141P0-01248"><member>THE CHANCELLOR OF THE EXCHEQUER'
        '</member><membercontribution> said, that the precedent occurred on the '
        '8th of April, 1850, on a Motion ...</membercontribution></p></z></y></x>')

selector = parsel.Selector(data)

for para in selector.xpath('//p'):
    id = para.xpath('@id').extract_first()
    texts = para.xpath('*/text()').extract()
    for text in texts:
        # do whatever search
        print(id, len(text), 'April' in text)

輸出:

S3V0141P0-01248 31 False
S3V0141P0-01248 77 True

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM