[英]Download, extract and read a gzip file in Python
I'd like to download, extract and iterate over a text file in Python without having to create temporary files.我想在 Python 中下载、提取和迭代文本文件,而无需创建临时文件。
basically, this pipe, but in python基本上,这个管道,但在 python 中
curl ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz | gunzip | processing step
Here's my code:这是我的代码:
def main():
import urllib
import gzip
# Download SEED database
print 'Downloading SEED Database'
handle = urllib.urlopen('ftp://ftp.theseed.org/genomes/SEED/SEED.fasta.gz')
with open('SEED.fasta.gz', 'wb') as out:
while True:
data = handle.read(1024)
if len(data) == 0: break
out.write(data)
# Extract SEED database
handle = gzip.open('SEED.fasta.gz')
with open('SEED.fasta', 'w') as out:
for line in handle:
out.write(line)
# Filter SEED database
pass
I don't want to use process.Popen() or anything because I want this script to be platform-independent.我不想使用 process.Popen() 或任何东西,因为我希望这个脚本与平台无关。
The problem is that the Gzip library only accepts filenames as arguments and not handles.问题是 Gzip 库只接受文件名作为参数而不是句柄。 The reason for "piping" is that the download step only uses up ~5% CPU and it would be faster to run the extraction and processing at the same time.
“管道”的原因是下载步骤只使用了大约 5% 的 CPU,同时运行提取和处理会更快。
EDIT : This won't work because编辑:这行不通,因为
"Because of the way gzip compression works, GzipFile needs to save its position and move forwards and backwards through the compressed file. This doesn't work when the “file” is a stream of bytes coming from a remote server; all you can do with it is retrieve bytes one at a time, not move back and forth through the data stream."
“由于 gzip 压缩的工作方式,GzipFile 需要保存它的位置并在压缩文件中前后移动。当“文件”是来自远程服务器的字节流时,这不起作用;所有你能做的它一次检索一个字节,而不是在数据流中来回移动。” - dive into python
- 深入蟒蛇
Which is why I get the error这就是我收到错误的原因
AttributeError: addinfourl instance has no attribute 'tell'
So how does curl url | gunzip | whatever
那么如何
curl url | gunzip | whatever
curl url | gunzip | whatever
curl url | gunzip | whatever
work? curl url | gunzip | whatever
工作?
只需gzip.GzipFile(fileobj=handle)
就可以了——换句话说,“Gzip 库只接受文件名作为参数而不接受句柄”并不是真的,你只需要使用fileobj=
命名参数。
I've found this question while searching for methods to download and unzip a gzip
file from an URL but I didn't manage to make the accepted answer work in Python 2.7.我在搜索从 URL 下载和解压缩
gzip
文件的方法时发现了这个问题,但我没有设法使接受的答案在 Python 2.7 中起作用。
Here's what worked for me (adapted from here ):这是对我有用的(改编自here ):
import urllib2
import gzip
import StringIO
def download(url):
# Download SEED database
out_file_path = url.split("/")[-1][:-3]
print('Downloading SEED Database from: {}'.format(url))
response = urllib2.urlopen(url)
compressed_file = StringIO.StringIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
# Extract SEED database
with open(out_file_path, 'w') as outfile:
outfile.write(decompressed_file.read())
# Filter SEED database
# ...
return
if __name__ == "__main__":
download("ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.0/fasta_files/RF00001.fa.gz")
I changed the target URL since the original one was dead: I just looked for a gzip
file served from an ftp server like in the original question.由于原始 URL 已死,我更改了目标 URL:我只是在原始问题中查找从 ftp 服务器提供的
gzip
文件。
A python3
solution which does not require a for
loop & writes the byte
object directly as a binary
stream:甲
python3
溶液,它不需要一个for
环路写入byte
直接作为对象binary
流:
import gzip
import urllib.request
def download_file(url):
out_file = '/path/to/file'
# Download archive
try:
# Read the file inside the .gz archive located at url
with urllib.request.urlopen(url) as response:
with gzip.GzipFile(fileobj=response) as uncompressed:
file_content = uncompressed.read()
# write to file in binary mode 'wb'
with open(out_file, 'wb') as f:
f.write(file_content)
return 0
except Exception as e:
print(e)
return 1
Call the function with retval=download_file(url)
to capture the return code
调用
retval=download_file(url)
函数捕获return code
for python 3.8 here is my code, wrote on 08/05/2020对于 python 3.8,这是我的代码,写于 08/05/2020
import re
from urllib import request
import gzip
import shutil
url1 = "https://www.destinationlighting.com/feed/sitemap_items1.xml.gz"
file_name1 = re.split(pattern='/', string=url1)[-1]
r1 = request.urlretrieve(url=url1, filename=file_name1)
txt1 = re.split(pattern=r'\.', string=file_name1)[0] + ".txt"
with gzip.open(file_name1, 'rb') as f_in:
with open(txt1, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.