简体   繁体   中英

Efficiently extract single file from .tar-archive

I have a .tgz file with a size of 2GB.

I want to extract only one .txt file with size of 2KB from the .tgz file.

I have the following code:

import tarfile
from contextlib import closing

with closing(tarfile.open("myfile.tgz")) as tar:
    subdir_and_files = [
        tarinfo for tarinfo in tar.getmembers()
        if tarinfo.name.startswith("myfile/first/second/text.txt")
        ]
    print subdir_and_files
    tar.extractall(members=subdir_and_files)

The problem is that it takes at least one minute until I get the extracted file. It seems that extractall extract the all file, but save only the one I asked.

Is there more efficient way to achieve it?

No.

The tar format is not well suited for extracting singl files quickly. This condition is exacerbated in most cases because the tar file is usually in a compressed stream. I'd suggest 7z.

Yes, kind of.

If you know that there only one file with that name or if you only want one file you could abort the extraction process after the first hit.

eg

scan the thing completely.

$ time tar tf /var/log/apache2/old/2016.tar.xz 
2016/
2016/access.log-20161023
2016/access.log-20160724
2016/ssl_access.log-20160711
2016/error.log-20160815
(...)
2016/error.log-20160918
2016/ssl_request.log-20160814
2016/access.log-20161017
2016/access.log-20160516
time: Real 0m1.5s  User 0m1.4s  System 0m0.2s

scan the thing from memory

$ time tar tf /var/log/apache2/old/2016.tar.xz  > /dev/null 
time: Real 0m1.3s  User 0m1.2s  System 0m0.2s

abort after first file

$ time tar tf /var/log/apache2/old/2016.tar.xz  | head -n1 
2016/
time: Real 0m0.0s  User 0m0.0s  System 0m0.0s

abort after three files

$ time tar tf /var/log/apache2/old/2016.tar.xz  | head -n3 
2016/
2016/access.log-20161023
2016/access.log-20160724
time: Real 0m0.0s  User 0m0.0s  System 0m0.0s

abort after some file in the "middle"

$ time tar xf /var/log/apache2/old/2016.tar.xz  2016/access.log-20160724  | head -n1 
time: Real 0m0.9s  User 0m0.9s  System 0m0.1s

abort after some file at the "bottom"

$ time tar xf /var/log/apache2/old/2016.tar.xz  2016/access.log-20160516  | head -n1 
time: Real 0m1.1s  User 0m1.1s  System 0m0.2s

I'm showing you here that if you kill the output pipe (standard out) of the GNU tar by exiting after the first line (head -n1), then the tar process also dies.

You can see that reading the whole archive takes more time than aborting after some file close to the "bottom" of the archive. You can also see that aborting the read after encountering a file at the top takes significantly less time.

I would not do this if I could decide the format of the archive.


Soo...

Instead of the list-comprehension thing that python-people love so much, iterate over tar.getmembers() (or whatever gives you one file at a time in that library) and break when you have encountered your desired result instead of expanding all the files into the list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM