简体   繁体   中英

Is there a Python module for transparently working with a file-like object as a buffer?

I'm working on a pure Python parser, where the input data may range in size from kilobytes to gigabytes. Is there a module that wraps a file-like object and abstracts explicit .open()/.seek()/.read()/.close() calls into a simple buffer-like object? You might think of this as the inverse of StringIO. I expect it might look something like:

with FileLikeObjectBackedBuffer(urllib.urlopen("http://www.google.com")) as buf:
    header = buf[0:0x10]
    footer = buf[-0x10:]

Note, I asked a similar quesetion yesterday, and accepted mmap ing a file. Here, I am specifically looking for a module that wraps a file-like object (for argument's sake, say like what is returned by urllib ).

Update I've repeatedly come back to this question since I first asked it, and it turns out urllib may not have been the best example. Its a bit of a special case since its a streaming interface. StringIO and bz2 expose a more traditional seek / read / close interface, and personally I use these more often. Therefore, I wrote a module that wraps file-like objects as buffers. You can check it out here .

Although urllib.urlopen returns a file-like obj, I don't believe it's possible to do what you want without writing your own - it doesn't support seek for instance, but does support next , read etc... And since you're dealing with a forward only stream - you'd have to handle jump-aheads by retrieving until you reach a certain point and caching for any backtracking.

IMHO - you can't efficiently skips part of a network IO stream (if you want the last byte, you still have to get all previous bytes to get there - how you manage that storage is up to you).

I would be tempted to urlretrieve (or similar) the file, and mmap as per your previous answer.

If your server can accept ranges (and the response size is known and from that derived blocks as per your example), then a possible work around is to use http://en.wikipedia.org/wiki/Byte_serving (but can't say I've ever tried that).

Given the example, if you only want the first 16 and last 16 and don't want to do something "too fancy":

from string import ascii_lowercase
from random import choice
from StringIO import StringIO

buf = ''.join(choice(ascii_lowercase) for _ in range(50))
print buf

sio_buf = StringIO(buf) # make it a bit more like a stream object
first16 = sio_buf.read(16)
print first16

from collections import deque
last16 = deque(iter(lambda: sio_buf.read(1), ''), 16) # read(1) may look bad but it's buffered anyway - so...
print ''.join(last16)

Output:

gpsgvqsbixtwyakpgefrhntldsjqlmfvyzwjoykhsapcmvjmar
gpsgvqsbixtwyakp
wjoykhsapcmvjmar

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM