简体   繁体   English

在Python中是否有一个快速的XML解析器允许我在流中将标记的开头作为字节偏移量?

[英]Is there a fast XML parser in Python that allows me to get start of tag as byte offset in stream?

I am working with potentially huge XML files containing complex trace information from on of my projects. 我正在使用包含来自我的项目的复杂跟踪信息的潜在巨大XML文件。

I would like to build indexes for those XML files so that one can quickly find sub sections of the XML document without having to load it all into memory. 我想为这些XML文件构建索引,以便可以快速找到XML文档的子部分,而无需将其全部加载到内存中。

If I have created a "shelve" index that could contains information like "books for author Joe" are at offsets [22322, 35446, 54545] then I can just open the xml file like a regular text file and seek to those offsets and then had that to one of the DOM parser that takes a file or strings. 如果我创建了一个“搁置”索引,可能包含“作者乔书籍”等信息的偏移[22322,35446,54545]那么我可以像常规文本文件一样打开xml文件并寻找那些偏移然后有一个带有文件或字符串的DOM解析器。

The part that I have not figured out yet is how to quickly parse the XML and create such an index. 我还没想到的部分是如何快速解析XML并创建这样的索引。

So what I need as a fast SAX parser that allows me to find the start offset of tags in the file together with the start events. 所以我需要一个快速的SAX解析器,它允许我在文件中找到标签的起始偏移量以及起始事件。 So I can parse a subsection of the XML together with the starting point into the document, extract the key information and store the key and offset in the shelve index. 因此,我可以将XML的一个子部分与文档的起点一起解析,提取关键信息并将关键字和偏移量存储在搁置索引中。

Since locators return line and column numbers in lieu of offset, you need a little wrapping to track line ends -- a simplified example (could have some offbyones;-)...: 由于定位器返回行号和列号代替偏移量,因此需要一些包裹来跟踪行尾 - 一个简化示例(可能有一些offbyones; - )...:

import cStringIO
import re
from xml import sax
from xml.sax import handler

relinend = re.compile(r'\n')

txt = '''<foo>
            <tit>Bar</tit>
        <baz>whatever</baz>
     </foo>'''
stm = cStringIO.StringIO(txt)

class LocatingWrapper(object):
    def __init__(self, f):
        self.f = f
        self.linelocs = []
        self.curoffs = 0

    def read(self, *a):
        data = self.f.read(*a)
        linends = (m.start() for m in relinend.finditer(data))
        self.linelocs.extend(x + self.curoffs for x in linends)
        self.curoffs += len(data)
        return data

    def where(self, loc):
        return self.linelocs[loc.getLineNumber() - 1] + loc.getColumnNumber()

locstm = LocatingWrapper(stm)

class Handler(handler.ContentHandler):
    def setDocumentLocator(self, loc):
        self.loc = loc
    def startElement(self, name, attrs):
        print '%s@%s:%s (%s)' % (name, 
                                 self.loc.getLineNumber(),
                                 self.loc.getColumnNumber(),
                                 locstm.where(self.loc))

sax.parse(locstm, Handler())

Of course you don't need to keep all of the linelocs around -- to save memory, you can drop "old" ones (below the latest one queried) but then you need to make linelocs a dict, etc. 当然你不需要保留所有的线阵 - 为了节省内存,你可以删除“旧的”(低于最新查询的那个),但是你需要将linelocs作为dict等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM