简体   繁体   English

在zip文件中是否有用于正则表达式匹配的python模块

[英]Is there a python module for regex matching in zip files

I have over a million text files compressed into 40 zip files. 我有超过一百万个文本文件压缩成40个zip文件。 I also have a list of about 500 model names of phones. 我还有一个大约500个手机型号名单。 I want to find out the number of times a particular model was mentioned in the text files. 我想找出文本文件中提到的特定模型的次数。

Is there any python module which can do a regex match on the files without unzipping it. 是否有任何python模块可以对文件进行正则表达式匹配而不解压缩它。 Is there a simple way to solve this problem without unzipping? 有没有解压这个问题的简单方法而不解压缩?

There's nothing that will automatically do what you want. 什么都不会自动做你想要的。

However, there is a python zipfile module that will make this easy to do. 但是,有一个python zipfile模块可以让这很容易。 Here's how to iterate over the lines in the file. 这是如何迭代文件中的行。

#!/usr/bin/python

import zipfile
f = zipfile.ZipFile('myfile.zip')

for subfile in f.namelist():
    print subfile
    data = f.read(subfile)
    for line in data.split('\n'):
        print line

You could loop through the zip files, reading individual files using the zipfile module and running your regex on those, eliminating to unzip all the files at once. 您可以遍历zip文件,使用zipfile模块读取单个文件并在这些文件上运行正则表达式,从而无需一次解压缩所有文件。

I'm fairly certain that you can't run a regex over the zipped data, at least not meaningfully. 我相当肯定你不能对压缩数据运行正则表达式,至少没有意义。

To access the contents of a zip file you have to unzip it, although the zipfile package makes this fairly easy, as you can unzip each file within an archive individually. 要访问zip文件的内容,您必须解压缩它,尽管zipfile包使这相当容易,因为您可以单独解压缩存档中的每个文件。

Python zipfile module Python zipfile模块

Isn't it (at least theoretically) possible, to read in the ZIP's Huffman coding and then translate the regexp into the Huffman code? 是不是(至少在理论上)可以读取ZIP的霍夫曼编码,然后将正则表达式翻译成霍夫曼代码? Might this be more efficient than first de-compressing the data, then running the regexp? 这可能比首先解压缩数据,然后运行正则表达式更有效吗?

(Note: I know it wouldn't be quite that simple: you'd also have to deal with other aspects of the ZIP coding—file layout, block structures, back-references—but one imagines this could be fairly lightweight.) (注意:我知道它不会那么简单:你还必须处理ZIP编码文件布局,块结构,反向引用的其他方面 - 但是人们想象这可能相当轻量级。)

EDIT: Also note that it's probably much more sensible to just use the zipfile solution. 编辑:还要注意,使用zipfile解决方案可能更明智。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM