用 Python 下載、解壓和讀取 gzip 文件

Question

我想在 Python 中下載、提取和迭代文本文件，而無需創建臨時文件。

基本上，這個管道，但在 python 中

curl ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz | gunzip | processing step

這是我的代碼：

def main():
    import urllib
    import gzip

    # Download SEED database
    print 'Downloading SEED Database'
    handle = urllib.urlopen('ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz')


    with open('SEED.fasta.gz', 'wb') as out:
        while True:
            data = handle.read(1024)
            if len(data) == 0: break
            out.write(data)

    # Extract SEED database
    handle = gzip.open('SEED.fasta.gz')
    with open('SEED.fasta', 'w') as out:
        for line in handle:
            out.write(line)

    # Filter SEED database
    pass

我不想使用 process.Popen() 或任何東西，因為我希望這個腳本與平台無關。

問題是 Gzip 庫只接受文件名作為參數而不是句柄。 “管道”的原因是下載步驟只使用了大約 5% 的 CPU，同時運行提取和處理會更快。

編輯：這行不通，因為

“由於 gzip 壓縮的工作方式，GzipFile 需要保存它的位置並在壓縮文件中前后移動。當“文件”是來自遠程服務器的字節流時，這不起作用；所有你能做的它一次檢索一個字節，而不是在數據流中來回移動。” - 深入蟒蛇

這就是我收到錯誤的原因

AttributeError: addinfourl instance has no attribute 'tell'

Answer 1

只需gzip.GzipFile(fileobj=handle)就可以了——換句話說，“Gzip 庫只接受文件名作為參數而不接受句柄”並不是真的，你只需要使用fileobj=命名參數。

Answer 2

我在搜索從 URL 下載和解壓縮gzip文件的方法時發現了這個問題，但我沒有設法使接受的答案在 Python 2.7 中起作用。

這是對我有用的（改編自here ）：

import urllib2
import gzip
import StringIO

def download(url):
    # Download SEED database
    out_file_path = url.split("/")[-1][:-3]
    print('Downloading SEED Database from: {}'.format(url))
    response = urllib2.urlopen(url)
    compressed_file = StringIO.StringIO(response.read())
    decompressed_file = gzip.GzipFile(fileobj=compressed_file)

    # Extract SEED database
    with open(out_file_path, 'w') as outfile:
        outfile.write(decompressed_file.read())

    # Filter SEED database
    # ...
    return

if __name__ == "__main__":    
    download("ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.0/fasta_files/RF00001.fa.gz")

由於原始 URL 已死，我更改了目標 URL：我只是在原始問題中查找從 ftp 服務器提供的gzip文件。

Answer 3

甲python3溶液，它不需要一個for環路寫入byte直接作為對象binary流：

import gzip
import urllib.request

    def download_file(url):
       out_file = '/path/to/file'

       # Download archive
       try:
          # Read the file inside the .gz archive located at url
          with urllib.request.urlopen(url) as response:
             with gzip.GzipFile(fileobj=response) as uncompressed:
                file_content = uncompressed.read()

          # write to file in binary mode 'wb'
          with open(out_file, 'wb') as f:
             f.write(file_content)
             return 0

       except Exception as e:
          print(e)
          return 1

調用retval=download_file(url)函數捕獲return code

Answer 4

對於 python 3.8，這是我的代碼，寫於 08/05/2020

import re
from urllib import request
import gzip
import shutil

url1 = "https://www.destinationlighting.com/feed/sitemap_items1.xml.gz"
file_name1 = re.split(pattern='/', string=url1)[-1]
r1 = request.urlretrieve(url=url1, filename=file_name1)
txt1 = re.split(pattern=r'\.', string=file_name1)[0] + ".txt"

with gzip.open(file_name1, 'rb') as f_in:
    with open(txt1, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

用 Python 下載、解壓和讀取 gzip 文件

問題描述

4 個解決方案

解決方案1
9 已采納 2010-08-23 14:41:21

解決方案2
2 2019-07-17 19:12:27

解決方案3
2 2020-04-13 20:13:49

解決方案4
0 2020-08-05 20:27:29

用 Python 下載、解壓和讀取 gzip 文件

問題描述

4 個解決方案

解決方案1 9 已采納 2010-08-23 14:41:21

解決方案2 2 2019-07-17 19:12:27

解決方案3 2 2020-04-13 20:13:49

解決方案4 0 2020-08-05 20:27:29

解決方案1
9 已采納 2010-08-23 14:41:21

解決方案2
2 2019-07-17 19:12:27

解決方案3
2 2020-04-13 20:13:49

解決方案4
0 2020-08-05 20:27:29