简体   繁体   English

是否有Python模块可透明地将文件内容用作缓冲区?

[英]Is there a Python module for transparently working with a file's contents as a buffer?

I'm working on a pure Python file parser for event logs, which may range in size from kilobytes to gigabytes. 我正在为事件日志使用纯Python文件解析器,其大小范围可能从千字节到千兆字节。 Is there a module that abstracts explicit .open() / .seek() / .read() / .close() calls into a simple buffer-like object? 是否有一个模块将显式的.open() / .seek() / .read() / .close()调用抽象为一个类似于缓冲区的简单对象? You might think of this as the inverse of StringIO . 您可能会认为这与StringIO I expect it might look something like: 我希望它可能看起来像:

with FileBackedBuffer('/my/favorite/path', 'rb') as buf:
    header = buf[0:0x10]
    footer = buf[0x10000000:]

The mmap module may fulfill my requirements; mmap模块可以满足我的要求; however, I have two reservations that I'd appreciate feedback on: 但是,我有两个保留意见,感谢您提出以下反馈:

  1. It is important that the module handle files larger than available RAM/swap. 模块处理大于可用RAM /交换空间的文件非常重要。 I am unsure if mmap can do this well. 我不确定mmap可以做到这一点。
  2. The mmap constructors are different depending on OS. mmap构造函数因操作系统而异。 This makes me hesitant as I am looking to write nicely cross-platform code, and would rather not muck in OS specifics. 这让我很犹豫,因为我想编写很好的跨平台代码,而不想弄清楚OS的细节。 I will if I need to, but this set off a warning that I might be looking in the wrong place. 如果需要,我会这样做,但这会发出警告,提示我可能在错误的位置。

If mmap is the correct module for such as task, how does it handle these two points? 如果mmap是诸如任务之类的正确模块,它将如何处理这两点? If it is not, what is an appropriate module? 如果不是,什么是合适的模块?

mmap can easily handle files larger than RAM/swap. mmap可以轻松处理大于RAM /交换的文件。 What mmap can't do is handle files larger than the address space, which means that 32bit systems are limited in how large a file they can use. mmap不能执行的操作是处理大于地址空间的文件,这意味着32位系统可以使用的文件大小受到限制。

What happens with mmap is that the OS will only have in memory as much data as it it chooses to, but you program will think it is all there. mmap发生的事情是,操作系统在内存中只会保留它选择的尽可能多的数据,但是您的程序会认为它已经全部存在。 Be careful in usage patters though since if your data DOESN'T fit in RAM and you jump around too randomly, it will swap (discard pages from your file that you haven't used recently to make room for the new pages to be loaded). 但是请谨慎使用模式,因为如果您的数据不适合放入RAM中并且您随机跳动,则会交换(从文件中丢弃您最近没有使用过的页面,以便为加载新页面腾出空间) 。

If you don't need to specify anything base fileno and length , I don't believe you need to worry about the platform specific arguments for mmap . 如果您不需要指定任何基本filenolength ,我相信您不必担心mmap的平台特定参数。 If you do need to worry about the extra arguments, then you will either have to master Windows versus Unix, or pass that on to your users. 如果您确实需要担心额外的参数,那么您将不得不精通Windows与Unix,或者将其传递给用户。 I don't know what your library will be, but it may be nice to provide reasonable defaults on both platforms while also allowing the user to tweak the options. 我不知道您的库是什么,但是最好在两个平台上都提供合理的默认值,同时允许用户调整选项。 It looks to me that it would be unlikely that you would care about the Windows tagname option, also, if you are cross platform, then just accept the Unix default for prot since you have no choice on Windows. 在我看来,您不太可能会关心Windows tagname选项,而且,如果您是跨平台的,则只需接受prot的Unix默认值,因为您在Windows上别无选择。 That only leaves caring about MAP_PRIVATE and MAP_SHARED . 只需关心MAP_PRIVATEMAP_SHARED The default is MAP_SHARED , but I'm not sure if that is the option that most closely matches Windows behavior, but accepting the default is probably fine there. 默认值为MAP_SHARED ,但是我不确定这是否是最接近Windows行为的选项,但是在此处接受默认值就可以了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM