Python 压缩文件在到达流结束标记之前结束。但是文件没有损坏

Question

i made a simple request code that downloads a file from a Server我制作了一个简单的请求代码，用于从服务器下载文件


r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:\...\index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close

when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.当我现在使用 7zip 在目录中手动提取文件时，一切都很好，文件 decrippts 正常。

i tried two ways to do it in a ython programm but scince the file ends with.lzma i guess the following one is a bether approach我在 ython 程序中尝试了两种方法，但由于文件以 .lzma 结尾，我想下面的方法是更好的方法

import lzma 
with open('C:\...\index_en.txt.lzma') as compressed:
    print(compressed.readline)
    with lzma.LZMAFile(compressed) as uncompressed:
        for line in uncompressed:
            print(line)

this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.这个给了我错误：在 for 循环的行中“压缩文件在到达流结束标记之前结束”。

the second way i tried was with 7zip, because by hand it worked fine我尝试的第二种方法是使用 7zip，因为手动它工作正常

with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
    archive.extract(path="C:\...\Json")

this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr..." line这个给了我错误：OSError 22 Invalid Argument at the "with py7zr..." line

i really dont understand where the problem here is.我真的不明白这里的问题出在哪里。 WHy does it work by hand but not in python?为什么手工可以，python却不行？ Thanks谢谢

Answer 1

You didn't close your file, so data stuck in user mode buffers isn't visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does).你没有关闭你的文件，所以卡在用户模式缓冲区中的数据在磁盘上不可见，直到文件在某个不确定的未来点被清理（可能根本不会发生，并且可能不会发生直到程序退出，即使它做）。 Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.因此，除了您写入的单个句柄之外，任何通过任何方式访问文件的尝试都不会看到未刷新的数据，这会导致它看起来好像文件被截断了，得到您观察到的错误。

The minimal solution is to actually call close , changing index_en.close to index_en.close() .最小的解决方案是实际调用close ，将index_en.close更改为index_en.close() 。 But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed;但实际上，您应该尽可能对所有文件（以及锁，类似套接字的东西，以及所有其他需要清理的资源）使用with语句，这样即使发生异常，文件也肯定会关闭； it's most important for files you're writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.这对于您正在写入的文件最为重要（如果没有它，数据可能不会刷新到磁盘），但即使对于打开以供读取的文件，在病态情况下您最终可能会达到打开文件句柄限制。

Rewriting your first block of code to be completely safe gets you:重写您的第一个代码块以使其完全安全可以让您：

with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
    index_en.write(r.content)

Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly.注意： request.Response对象也是上下文管理器，所以我将它添加到with以确保底层连接被及时释放回池中。 I also prefixed your local path with an r to make it a raw string;我还在您的本地路径前加上r前缀，使其成为原始字符串； on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn't get corrupted (eg "C:\foo" is actually "C:<form feed>oo" , containing neither a backslash nor an f ).在 Windows 上，路径中有反斜杠，你总是想这样做，这样以 Python 识别为字符串文字转义的字符开头的文件或目录不会被破坏（例如"C:\foo"实际上是"C:<form feed>oo" ，既不包含反斜杠也不包含f )。

You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):你甚至可以稍微优化一下，以防文件很大，通过将数据流式传输到文件中（需要大部分固定的 memory 开销，与底层连接的缓冲区大小相关）而不是急切地获取（需要 memory 与文件大小成比例） ):

# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
    # iter_content(None) produces an iterator of chunks of data (of whatever size
    # is available in a single system call)
    # Changing to writelines means the iterator is consumed and written
    # as the data arrives
    index_en.writelines(r.iter_content(None))

Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn't consumed and freed immediately).使用with语句控制requests.get在这里更为重要（因为stream=True模式意味着底层套接字不会立即被消耗和释放）。

Also note that print(compressed.readline) is doing nothing (because you didn't call readline ).还要注意print(compressed.readline)什么都不做（因为你没有调用readline ）。 If there is some line of text in the response prior to the raw LZMA data, you failed to skip it.如果在原始 LZMA 数据之前的响应中有一些文本行，则您未能跳过它。 If there is not such a garbage line, and if you'd called readline properly (with print(compressed.readline()) ), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.如果没有这样的垃圾行，并且如果您正确调用了readline （使用print(compressed.readline()) ），它会破坏解压缩，因为文件指针现在会跳过前几个（或多个）字节文件的一部分，着陆在一些主要是随机偏移处。

Lastly,最后，

with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
    archive.extract(path="C:\...\Json")

is wrong because you passed it a mode indicating you're opening it for write, when you're clearly attempting to read from it;是错误的，因为当您显然试图从中读取时，您向它传递了一种模式，表明您正在打开它进行写入； either omit the 'w' or change it to 'r' .要么省略'w'或将其更改为'r' 。

Python 压缩文件在到达流结束标记之前结束。但是文件没有损坏

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-12-07 15:44:14

Python 压缩文件在到达流结束标记之前结束。 但是文件没有损坏

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-12-07 15:44:14

Python 压缩文件在到达流结束标记之前结束。但是文件没有损坏

解决方案1
1 已采纳 2022-12-07 15:44:14