简体   繁体   中英

Python: Extracting specific files with pattern from tar.gz without extracting the complete file

I want to extract all files with the pattern *_sl_H* from many tar.gz files, without extracting all files from the archives.

I found these lines, but it is not possible to work with wildcards ( https://pymotw.com/2/tarfile/ ):

import tarfile
import os

os.mkdir('outdir')
t = tarfile.open('example.tar', 'r')
t.extractall('outdir', members=[t.getmember('README.txt')])
print os.listdir('outdir')

Does someone have an idea? Many thanks in advance.

Take a look at TarFile.getmembers() method which returns the members of the archive as a list. After you have this list, you can decide with a condition which file is going to be extracted.

import tarfile
import os

os.mkdir('outdir')
t = tarfile.open('example.tar', 'r')
for member in t.getmembers():
    if "_sl_H" in member.name:
        t.extract(member, "outdir")

print os.listdir('outdir')

You can extract all files matching your pattern from many tar as follows:

  1. Use glob to get you a list of all of the *.tar or *.gz files in a given folder.

  2. For each tar file, get a list of the files in each tar file using the getmembers() function.

  3. Use a regular expression (or a simple if "xxx" in test) to filter the required files.

  4. Pass this list of matching files to the members parameter in the extractall() function.

  5. Exception handling is added to catch badly encoded tar files.

For example:

import tarfile
import glob
import re

reT = re.compile(r'.*?_sl_H.*?')

for tar_filename in glob.glob(r'\my_source_folder\*.tar'):
    try:
        t = tarfile.open(tar_filename, 'r')
    except IOError as e:
        print(e)
    else:
        t.extractall('outdir', members=[m for m in t.getmembers() if reT.search(m.name)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM