簡體   English   中英

如何使用python解析XML層次結構?

[英]How to parse XML hierarchies with python?

我是python的新手,並且一直在進行各種項目以加快速度。 目前,我正在制定一個例程,以通讀《聯邦法規》,並為每個段落打印該段落的組織層次結構。 例如,CFR的XML方案的簡化版本如下所示:

<CHAPTER>
<HD SOURCE="HED">PART 229—NONDISCRIMINATION ON THE BASIS OF SEX IN EDUCATION PROGRAMS OR ACTIVITIES RECEIVING FEDERAL FINANCIAL ASSISTANCE</HD>
     <SECTION>
        <SECTNO>### 229.120</SECTNO>
        <SUBJECT>Transfers of property.</SUBJECT>
        <P>If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).</P>
     </SECTION>

我希望能夠將其打印到CSV,以便可以運行文本分析:

標題22,第2卷,第229部分,第228.120節,如果接收人出售或以其他方式轉讓財產(...),但須遵守### 229.205至229.235(a)的規定。

請注意,我不是從XML中獲取標題和卷號,因為它們實際上是以更為標准化的格式包含在文件名中的。

因為我是Python的新手,所以該代碼主要基於Udacity的計算機科學課程中的搜索引擎代碼。 到目前為止,這是我已經編寫/修改的Python:

import os
import urllib2
from xml.dom.minidom import parseString
file_path = '/Users/owner1/Downloads/CFR-2012/title-22/CFR-2012-title22-vol1.xml'
file_name = os.path.basename(file_path) #Gets the filename from the path.
doc = open(file_path)
page = doc.read()

def clean_title(file_name): #Gets the title number from the filename.
    start_title = file_name.find('title')
    end_title = file_name.find("-", start_title+1)
    title = file_name[start_title+5:end_title]
    return title

def clean_volume(file_name): #Gets the volume number from the filename.
    start_volume = file_name.find('vol')
    end_volume = file_name.find('.xml', start_volume)
    volume = file_name[start_volume+3:end_volume]
    return volume

def get_next_section(page): #Gets all of the text between <SECTION> tags.
    start_section = page.find('<SECTION')
    if start_section == -1:
        return None, 0
    start_text = page.find('>', start_section)
    end_quote = page.find('</SECTION>', start_text + 1)
    section = page[start_text + 1:end_quote]
    return section, end_quote

def get_section_number(section): #Within the <SECTION> tag, find the section number based on the <SECTNO> tag.
    start_section_number = section.find('<SECTNO>###')
    if start_section_number == -1:
        return None, 0
    end_section_number = section.find('</SECTNO>', start_section_number)
    section_number = section[start_section_number+11:end_section_number]
    return section_number, end_section_number

def get_paragraph(section): #Within the <SECTION> tag, finds <P> paragraphs.
    start_paragraph = section.find('<P>')
    if start_paragraph == -1:
        return None, 0
    end_paragraph = section.find('</P>', start_paragraph)
    paragraph = section[start_paragraph+3:end_paragraph]
    return start_paragraph, paragraph, end_paragraph


def print_all_paragraphs(page): #This is the section that I would *like* to have print each paragraph and the citation hierarchy.
    section, endpos = get_next_section(page)
    for pragraph in section:
        title = clean_title(file_name)
        volume = clean_volume(file_name)
        section, endpos = get_next_section(page)
        section_number, end_section_number = get_section_number(section)
        start_paragraph, paragraph, end_paragraph = get_paragraph(section)
        if paragraph:
            print "Title: "+ title + " Volume: "+ volume +" Section Number: "+ section_number + " Text: "+ paragraph
            page = page[end_paragraph:]
        else:
            break

print print_all_paragraphs(page)
doc.close()

目前,此代碼存在以下問題(示例輸出如下):

  1. 它多次打印第一段。 如何打印每個

    帶有自己的標題號,卷號等的標簽?

  2. CFR具有“保留”的空白部分。 這些部分沒有

    標簽,因此if循環中斷。 我曾嘗試實現for / while循環,但是由於某些原因,當我執行此操作時,代碼僅打印重復找到的第一段。

這是輸出示例:

Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member 

of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.11 Text: The Information and Privacy Coordinator shall be responsible for conducting a program for systematic declassification review of historically valuable records that were exempted from the automatic declassification provisions of section 3.3 of the Executive Order. The Information and Privacy Coordinator shall prioritize such review on the basis of researcher interest and the likelihood of declassification upon review.
Title: 22 Volume: 1 Section Number:  9.12 Text: For Department procedures regarding the access to classified information by historical researchers and certain former government personnel, see Sec. 171.24 of this Title.
Title: 22 Volume: 1 Section Number:  9.13 Text: Specific controls on the use, processing, storage, reproduction, and transmittal of classified information within the Department to provide protection for such information and to prevent access by unauthorized persons are contained in Volume 12 of the Department's Foreign Affairs Manual.
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
None

理想情況下,引用信息之后的每個條目都應該不同。

我應該運行哪種循環才能正確打印此內容? 有沒有一種更“ pythonic”的方式來進行這種文本提取?

我了解我是一個完全的新手,我面臨的主要問題之一是我根本沒有詞匯或主題知識來真正找到有關以這種詳細程度進行XML解析的詳細答案。 任何推薦的讀物也將受到歡迎。

我喜歡用XPATH或XSLT解決類似的問題。 您可以在lxml中找到一個很棒的實現(不是標准發行版中的,需要安裝)。 例如,XPATH // CHAPTER / HD / SECTION [SECTNO]選擇帶有數據的所有部分。 您可以使用相對的XPATH語句從那里獲取所需的值。 多個嵌套的for循環消失。 XPATH有一些學習曲線,但是這里有很多示例。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM