简体   繁体   English

用 Python 下载、解压和读取 gzip 文件

[英]Download, extract and read a gzip file in Python

I'd like to download, extract and iterate over a text file in Python without having to create temporary files.我想在 Python 中下载、提取和迭代文本文件,而无需创建临时文件。

basically, this pipe, but in python基本上,这个管道,但在 python 中

curl ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz | gunzip | processing step

Here's my code:这是我的代码:

def main():
    import urllib
    import gzip

    # Download SEED database
    print 'Downloading SEED Database'
    handle = urllib.urlopen('ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz')


    with open('SEED.fasta.gz', 'wb') as out:
        while True:
            data = handle.read(1024)
            if len(data) == 0: break
            out.write(data)

    # Extract SEED database
    handle = gzip.open('SEED.fasta.gz')
    with open('SEED.fasta', 'w') as out:
        for line in handle:
            out.write(line)

    # Filter SEED database
    pass

I don't want to use process.Popen() or anything because I want this script to be platform-independent.我不想使用 process.Popen() 或任何东西,因为我希望这个脚本与平台无关。

The problem is that the Gzip library only accepts filenames as arguments and not handles.问题是 Gzip 库只接受文件名作为参数而不是句柄。 The reason for "piping" is that the download step only uses up ~5% CPU and it would be faster to run the extraction and processing at the same time. “管道”的原因是下载步骤只使用了大约 5% 的 CPU,同时运行提取和处理会更快。


EDIT : This won't work because编辑:这行不通,因为

"Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and forth through the data stream." “由于 gzip 压缩的工作方式,GzipFile 需要保存它的位置并在压缩文件中前后移动。当“文件”是来自远程服务器的字节流时,这不起作用;所有你能做的它一次检索一个字节,而不是在数据流中来回移动。” - dive into python - 深入蟒蛇

Which is why I get the error这就是我收到错误的原因

AttributeError: addinfourl instance has no attribute 'tell'

So how does curl url | gunzip | whatever那么如何curl url | gunzip | whatever curl url | gunzip | whatever curl url | gunzip | whatever work? curl url | gunzip | whatever工作?

只需gzip.GzipFile(fileobj=handle)就可以了——换句话说,“Gzip 库只接受文件名作为参数而不接受句柄”并不是真的,你只需要使用fileobj=命名参数。

I've found this question while searching for methods to download and unzip a gzip file from an URL but I didn't manage to make the accepted answer work in Python 2.7.我在搜索从 URL 下载和解压缩gzip文件的方法时发现了这个问题,但我没有设法使接受的答案在 Python 2.7 中起作用。

Here's what worked for me (adapted from here ):这是对我有用的(改编自here ):

import urllib2
import gzip
import StringIO

def download(url):
    # Download SEED database
    out_file_path = url.split("/")[-1][:-3]
    print('Downloading SEED Database from: {}'.format(url))
    response = urllib2.urlopen(url)
    compressed_file = StringIO.StringIO(response.read())
    decompressed_file = gzip.GzipFile(fileobj=compressed_file)

    # Extract SEED database
    with open(out_file_path, 'w') as outfile:
        outfile.write(decompressed_file.read())

    # Filter SEED database
    # ...
    return

if __name__ == "__main__":    
    download("ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.0/fasta_files/RF00001.fa.gz")

I changed the target URL since the original one was dead: I just looked for a gzip file served from an ftp server like in the original question.由于原始 URL 已死,我更改了目标 URL:我只是在原始问题中查找从 ftp 服务器提供的gzip文件。

A python3 solution which does not require a for loop & writes the byte object directly as a binary stream:python3溶液,它不需要一个for环路写入byte直接作为对象binary流:

import gzip
import urllib.request

    def download_file(url):
       out_file = '/path/to/file'

       # Download archive
       try:
          # Read the file inside the .gz archive located at url
          with urllib.request.urlopen(url) as response:
             with gzip.GzipFile(fileobj=response) as uncompressed:
                file_content = uncompressed.read()

          # write to file in binary mode 'wb'
          with open(out_file, 'wb') as f:
             f.write(file_content)
             return 0

       except Exception as e:
          print(e)
          return 1

Call the function with retval=download_file(url) to capture the return code调用retval=download_file(url)函数捕获return code

for python 3.8 here is my code, wrote on 08/05/2020对于 python 3.8,这是我的代码,写于 08/05/2020

import re
from urllib import request
import gzip
import shutil

url1 = "https://www.destinationlighting.com/feed/sitemap_items1.xml.gz"
file_name1 = re.split(pattern='/', string=url1)[-1]
r1 = request.urlretrieve(url=url1, filename=file_name1)
txt1 = re.split(pattern=r'\.', string=file_name1)[0] + ".txt"

with gzip.open(file_name1, 'rb') as f_in:
    with open(txt1, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM