简体   繁体   English

如何使用python解析XML层次结构?

[英]How to parse XML hierarchies with python?

I'm new to python and have been taking on various projects to get up to speed. 我是python的新手,并且一直在进行各种项目以加快速度。 At the moment, I'm working on a routine that will read through the Code of Federal Regulations and for each paragraph, print the organizational hierarchy for that paragraph. 目前,我正在制定一个例程,以通读《联邦法规》,并为每个段落打印该段落的组织层次结构。 For example, a simplified version of the CFR's XML scheme would look like: 例如,CFR的XML方案的简化版本如下所示:

<CHAPTER>
<HD SOURCE="HED">PART 229—NONDISCRIMINATION ON THE BASIS OF SEX IN EDUCATION PROGRAMS OR ACTIVITIES RECEIVING FEDERAL FINANCIAL ASSISTANCE</HD>
     <SECTION>
        <SECTNO>### 229.120</SECTNO>
        <SUBJECT>Transfers of property.</SUBJECT>
        <P>If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).</P>
     </SECTION>

I'd like to be able to print this to a CSV so that I can run text analysis: 我希望能够将其打印到CSV,以便可以运行文本分析:

Title 22, Volume 2, Part 229, Section 228.120, If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a). 标题22,第2卷,第229部分,第228.120节,如果接收人出售或以其他方式转让财产(...),但须遵守### 229.205至229.235(a)的规定。

Note that I'm not taking the Title and Volume numbers from the XML, because they are actually included in the file name in a much more standardized format. 请注意,我不是从XML中获取标题和卷号,因为它们实际上是以更为标准化的格式包含在文件名中的。

Because I'm such a Python newbie, the code is mostly based on the search-engine code from Udacity's computer science course. 因为我是Python的新手,所以该代码主要基于Udacity的计算机科学课程中的搜索引擎代码。 Here's the Python I've written/adapted so far: 到目前为止,这是我已经编写/修改的Python:

import os
import urllib2
from xml.dom.minidom import parseString
file_path = '/Users/owner1/Downloads/CFR-2012/title-22/CFR-2012-title22-vol1.xml'
file_name = os.path.basename(file_path) #Gets the filename from the path.
doc = open(file_path)
page = doc.read()

def clean_title(file_name): #Gets the title number from the filename.
    start_title = file_name.find('title')
    end_title = file_name.find("-", start_title+1)
    title = file_name[start_title+5:end_title]
    return title

def clean_volume(file_name): #Gets the volume number from the filename.
    start_volume = file_name.find('vol')
    end_volume = file_name.find('.xml', start_volume)
    volume = file_name[start_volume+3:end_volume]
    return volume

def get_next_section(page): #Gets all of the text between <SECTION> tags.
    start_section = page.find('<SECTION')
    if start_section == -1:
        return None, 0
    start_text = page.find('>', start_section)
    end_quote = page.find('</SECTION>', start_text + 1)
    section = page[start_text + 1:end_quote]
    return section, end_quote

def get_section_number(section): #Within the <SECTION> tag, find the section number based on the <SECTNO> tag.
    start_section_number = section.find('<SECTNO>###')
    if start_section_number == -1:
        return None, 0
    end_section_number = section.find('</SECTNO>', start_section_number)
    section_number = section[start_section_number+11:end_section_number]
    return section_number, end_section_number

def get_paragraph(section): #Within the <SECTION> tag, finds <P> paragraphs.
    start_paragraph = section.find('<P>')
    if start_paragraph == -1:
        return None, 0
    end_paragraph = section.find('</P>', start_paragraph)
    paragraph = section[start_paragraph+3:end_paragraph]
    return start_paragraph, paragraph, end_paragraph


def print_all_paragraphs(page): #This is the section that I would *like* to have print each paragraph and the citation hierarchy.
    section, endpos = get_next_section(page)
    for pragraph in section:
        title = clean_title(file_name)
        volume = clean_volume(file_name)
        section, endpos = get_next_section(page)
        section_number, end_section_number = get_section_number(section)
        start_paragraph, paragraph, end_paragraph = get_paragraph(section)
        if paragraph:
            print "Title: "+ title + " Volume: "+ volume +" Section Number: "+ section_number + " Text: "+ paragraph
            page = page[end_paragraph:]
        else:
            break

print print_all_paragraphs(page)
doc.close()

At the moment, this code has the following issues (example output to follow): 目前,此代码存在以下问题(示例输出如下):

  1. It prints the first paragraph multiple times. 它多次打印第一段。 How can I print each 如何打印每个

    tag with its own title number, volume number, etc? 带有自己的标题号,卷号等的标签?

  2. The CFR has empty sections that are "Reserved". CFR具有“保留”的空白部分。 These sections don't have 这些部分没有

    tags, so the if loop breaks. 标签,因此if循环中断。 I've tried implementing for/while loops, but for some reason when I do this the code then just prints the first paragraph it finds repeatedly. 我曾尝试实现for / while循环,但是由于某些原因,当我执行此操作时,代码仅打印重复找到的第一段。

Here's an example of the output: 这是输出示例:

Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member 

of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number:  9.11 Text: The Information and Privacy Coordinator shall be responsible for conducting a program for systematic declassification review of historically valuable records that were exempted from the automatic declassification provisions of section 3.3 of the Executive Order. The Information and Privacy Coordinator shall prioritize such review on the basis of researcher interest and the likelihood of declassification upon review.
Title: 22 Volume: 1 Section Number:  9.12 Text: For Department procedures regarding the access to classified information by historical researchers and certain former government personnel, see Sec. 171.24 of this Title.
Title: 22 Volume: 1 Section Number:  9.13 Text: Specific controls on the use, processing, storage, reproduction, and transmittal of classified information within the Department to provide protection for such information and to prevent access by unauthorized persons are contained in Volume 12 of the Department's Foreign Affairs Manual.
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
Title: 22 Volume: 1 Section Number:  9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
None

Ideally, each of the entries after the citation information would be different. 理想情况下,引用信息之后的每个条目都应该不同。

What kind of loop should I run to print this properly? 我应该运行哪种循环才能正确打印此内容? Is there a more "pythonic" way of doing this kind of text extraction? 有没有一种更“ pythonic”的方式来进行这种文本提取?

I understand that I am a complete novice, and one of the major problems I'm facing is that I simply don't have the vocabulary or topic knowledge to really find detailed answers about parsing XML with this level of detail. 我了解我是一个完全的新手,我面临的主要问题之一是我根本没有词汇或主题知识来真正找到有关以这种详细程度进行XML解析的详细答案。 Any recommended reading would also be welcome. 任何推荐的读物也将受到欢迎。

I like to solve problems like this with XPATH or XSLT. 我喜欢用XPATH或XSLT解决类似的问题。 You can find a great implementation in lxml (not in standard distro, needs to be installed). 您可以在lxml中找到一个很棒的实现(不是标准发行版中的,需要安装)。 For instance, the XPATH //CHAPTER/HD/SECTION[SECTNO] selects all sections with data. 例如,XPATH // CHAPTER / HD / SECTION [SECTNO]选择带有数据的所有部分。 You use relative XPATH statements to grab the values you want from there. 您可以使用相对的XPATH语句从那里获取所需的值。 Multiple nested for loops disappear. 多个嵌套的for循环消失。 XPATH has a bit of a learning curve, but there many examples out there. XPATH有一些学习曲线,但是这里有很多示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM