简体   繁体   English

如何使用python sax解析器将XML标签之间的文本存储为字符串?

[英]How can I get and store the text between XML tags as a string with the python sax parser?

I have an XML file that looks something like this: 我有一个看起来像这样的XML文件:

<TAG1>
   <TAG2 attribute1 = "attribute_i_need" attribute2 = "attribute_i_dont_need" >
      Text I want to use
   </TAG2>
   <TAG3>
      Text I'm not interested in
   </TAG3>
   <TAG4>
      More text I want to use
   </TAG4>

What I need is to somehow get "Text I want to use" and "More text I want to use", but not "Text I'm not interested in" in the form of a string that can later be used by some arbitrary function. 我需要以某种方式获取“我想使用的文本”和“我想使用的更多文本”,而不是以字符串的形式获取“我不想使用的文本”,以后可以由某些任意函数使用。 I also need to get "attribute_i_need" in the form of a string. 我还需要以字符串形式获取“ attribute_i_need”。 I haven't really used the sax parser before and I'm completely stuck. 之前我还没有真正使用过sax解析器,但是我完全陷入了困境。 I was able to just print all of the text in the document using the following: 我能够使用以下命令打印文档中的所有文本:

import xml.sax

class myHandler(xml.sax.ContentHandler):

    def characters(self, content):
        print (content)

parser = xml.sax.make_parser()
parser.setContentHandler(myHandler())
parser.parse(open("sample.xml", "r"))

This will basically give me the output: 这基本上会给我输出:

Text I want to use
Text I'm not interested in
More text I want to use

But the problem is twofold. 但是问题是双重的。 First of all, this includes text that I have no interest in. Second, all it does is print the text. 首先,这包括我不感兴趣的文本。其次,它所做的只是打印文本。 I can't figure out how to print specific text only, or write code that will return the text as a string that I can assign to a variable and use later. 我无法弄清楚如何仅打印特定文本,或者无法编写将文本作为字符串返回的代码,可以将其分配给变量并在以后使用。 And I don't even know how to start with extracting the attribute I'm interested in. 而且,我什至不知道如何从提取我感兴趣的属性开始。

Does anyone know how to solve this problem? 有谁知道如何解决这个问题? And I would prefer a solution that involves the sax parser, because I at least have a vague understanding of how it works. 而且我更喜欢包含sax解析器的解决方案,因为我至少对它的工作方式有模糊的了解。

The idea is to start saving all characters after encountering TAG2 or TAG4 and stop whenever an element ends. 这个想法是在遇到TAG2或TAG4后开始保存所有字符,并在元素结束时停止。 An opening element is also an opportynity to inspect and save interesting attributes. 开放元素也是检查和保存有趣属性的机会。

import xml.sax

class myHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.text = []
        self.keeping_text = False
        self.attributes = []

    def startElement(self, name, attrs):
        if name.lower() in ('tag2', 'tag4'):
            self.keeping_text = True

        try:
            # must attribute1 be on a tag2 or anywhere?
            attr = attrs.getValue('attribute1')
            self.attributes.append(attr)
        except KeyError:
            pass

    def endElement(self, name):
        self.keeping_text = False

    def characters(self, content):
        if self.keeping_text:
            self.text.append(content)

parser = xml.sax.make_parser()
handler = myHandler()
parser.setContentHandler(handler)
parser.parse(open("sample.xml", "r"))

print handler.text
print handler.attributes

# [u'\n', u'      Text I want to use', u'\n', u'   ',
#  u'\n', u'      More text I want to use', u'\n', u'   ']
# [u'attribute_i_need']

I think BeautifulSoup or even bare lxml would be easier. 我认为BeautifulSoup甚至是裸露的lxml都会更容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM