简体   繁体   English

将解压缩的文件写入从Web服务器获取的磁盘

[英]writing decompressed file to disk fetched from web server

I can get a file that has content-encoding as gzip . 我可以得到一个content-encodinggzip

So does that mean that the server is storing it as compressed file or it is also true for files stored as compressed zip or 7z files too? 那么这是否意味着服务器会将其存储为压缩文件,或者对于存储为压缩zip或7z文件的文件也是如此?

and if so (where durl is a zip file) 如果是这样(其中durl是zip文件)

>>> durl = 'https://db.tt/Kq0byWzW'
>>> dresp = requests.get(durl, allow_redirects=True, stream=True)
>>> dresp.headers['content-encoding']
'gzip'

>>> r = requests.get(durl, stream=True)
>>> data = r.raw.read(decode_content=True)

but data is coming out to be empty while I want to extract the zip file to disk on the go !! 但是,当我想将zip文件提取到磁盘上时,数据却是空的!

You need the content from the requests file to write it. 您需要请求文件中的内容来编写它。 Confirmed working: 确认工作:

import requests
durl = 'https://db.tt/Kq0byWzW'
dresp = requests.get(durl, allow_redirects=True, stream=True)
dresp.headers['content-encoding']

file = open('test.html', 'w')
file.write(dresp.text)

So first of all durl is not a zip file, it is a drop box landing page. 因此,首先durl不是zip文件,而是投递箱登录页面。 So what you are looking at is HTML which is being sent using gzip encoding. 因此,您正在查看的是使用gzip编码发送的HTML。 If you where to decode the data from the raw socket using gzip you would simply get the HTML. 如果您在哪里使用gzip从原始套接字解码数据,则只需获取HTML。 So the use of raw is really just hiding that you accidentally go an other file than the one you thought. 因此,使用raw实际上只是隐藏了您不小心转到了另一个文件之外的其他位置的想法。

Based on https://plus.google.com/u/0/100262946444188999467/posts/VsxftxQnRam where you ask 根据您询问的https://plus.google.com/u/0/100262946444444188999467/posts/VsxftxQnRam

Does anyone has any idea about writing compressed file directy to disk to decompressed state? 是否有人对将压缩文件直接写入磁盘处于解压缩状态有任何想法?

I take it you are really trying to fetch a zip and decompress it directly to a directory without first storing it. 我认为您实际上是在尝试获取zip并将其直接解压缩到目录中,而不先存储它。 To do this you need to use https://docs.python.org/2/library/zipfile.html 为此,您需要使用https://docs.python.org/2/library/zipfile.html

Though at this point the problem becomes that the response from requests isn't actually seekable, which zipfile requires in order to work (one of the first things it will do is seek to the end of the file to determine how long it is). 尽管此时问题变成了实际上无法从请求中找到请求,但是zipfile才能正常工作(它要做的第一件事就是寻找文件的末尾以确定文件的长度)。

To get around this you need to wrap the response in a file like object. 为了解决这个问题,您需要将响应包装在类似object的文件中。 Personally I would recommend using tempfile.SpooledTemporaryFile with a max size set. 我个人建议使用最大大小设置为tempfile.SpooledTemporaryFile This way your code would switch to writing things to disk if the file was bigger than you expected. 这样,如果文件比预期的大,您的代码将切换为将内容写入磁盘。

import requests
import tempfile
import zipfile

KB = 1<<10
MB = 1<<20

url = '...' # Set url to the download link.

resp = requests.get(url, stream=True)
with tmp as tempfile.SpooledTemporaryFile(max_size=500*MB):
    for chunk in resp.iter_content(4*KB):
        tmp.write(chunk)
    archive = zipfile.ZipFile(tmp)
    archive.extractall(path)

Same code using io.BytesIO : 使用io.BytesIO相同代码:

resp = requests.get(url, stream=True)
tmp = io.BytesIO()
for chunk in resp.iter_content(4*KB):
    tmp.write(chunk)
archive = zipfile.ZipFile(tmp)
archive.extractall(path)

You have to differentiate between content-encoding (not to be confused with transfer-encoding ) and content-type . 您必须区分content-encoding (不要与transfer-encoding混淆)和content-type

The gist of it is that content-type is the media-type (the real file-type) of the resource you are trying to get. 要点是, content-type是您要获取的资源的媒体类型(实际文件类型)。 And content-encoding is any kind of modification applied to it before sending it to the client. content-encoding是在将其发送给客户端之前对其进行的任何修改。

So let's assume you'd like to get a resource named "foo.txt". 因此,假设您要获取一个名为“ foo.txt”的资源。 It will probably have a content-type of text/plain .In addition to that, the data can be modified when sending over the wire. 它可能具有text/plain的内容类型。 除此之外 ,通过有线发送时可以修改数据。 This is the content-encoding . 这是content-encoding So, with the above example, you can have a content-type of text/plain and a content-encoding of gzip . 因此,在上面的示例中,您可以具有text/plaincontent-encoding类型和gzipcontent-encoding This means that before the server sends the file out onto the wire, it will compress it using gzip on the fly. 这意味着在服务器将文件发送到网络上之前,它将使用gzip压缩文件。 So the only bytes which traverse the net are zipped. 因此,仅遍历网络的字节被压缩了。 Not the raw-bytes of the original file ( foo.txt ). 不是原始文件( foo.txt )的原始字节。

It is the job of the client to process these headers accordingly. 客户端的工作是相应地处理这些标头。

Now, I am not 100% sure if requests , or the underlying python libs do this but chances are they do. 现在,我不是100%肯定requests或底层的python libs是否这样做,但是有可能这样做。 If not, Python ships with a default gzip library , so you could do it on your own without a problem. 如果不是这样,Python会附带一个默认的gzip库 ,因此您可以自己完成此操作而不会出现问题。

With the above in mind, to respond to your question: No, having a "content-encoding" of gzip does not mean that the remote resource is a zip-file. 考虑到以上几点,回答您的问题:不,拥有gzip的“内容编码”并不意味着远程资源是一个zip文件。 The field containing that information is content-type (based on your question this has probably a value of application/zip or application/x-7z-compressed depending of actual compression algorithm used). 包含该信息的字段为content-type (根据您的问题,这可能具有application/zipapplication/x-7z-compressed具体取决于所使用的实际压缩算法)。

If you cannot determine the real file-type based on the content-type field (f.ex. if it is application/octet-stream ), you could just save the file to disk, and open it up with a hex editor. 如果您不能根据content-type字段确定真实的文件类型(例如,如果是application/octet-stream ),则可以将文件保存到磁盘,然后使用十六进制编辑器将其打开。 In the case of a 7z file you should see the byte sequence 37 7a bc af 27 1c somewhere. 对于7z文件,您应该在某处看到字节序列37 7a bc af 27 1c Most likely at the beginning of the file or at EOF-112 bytes. 最有可能在文件开头或EOF-112字节处。 In the case of a gzip file, it should be 1f 8b at the beginning of the file. 如果是gzip文件,则文件开头应为1f 8b

Given that you have gzip in the content-encoding field: If you get a 7z file, you can be certain that requests has parsed content-encoding and properly decoded it for you. 假设您在content-encoding字段中具有gzip :如果您获得7z文件,则可以确定requests已解析了content-encoding并已为您正确解码。 If you get a gzip file, it could mean two things. 如果获得gzip文件,则可能有两件事。 Either requests has not decoded anything, of the file is indeed a gzip file, as it could be a gzip file sent with the gzip encoding. 任一requests都未解码任何内容,该文件的确是gzip文件,因为它可能是使用gzip编码发送的gzip文件。 Which would mean that it's doubly compressed. 这意味着它被双重压缩了。 This would not make any sense, but, depending on the server this could still happen. 这没有任何意义,但是根据服务器的不同,这种情况仍然可能发生。

You could simply try to run gunzip on the console and see what you get. 您可以尝试在控制台上运行gunzip并查看得到的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM