简体   繁体   中英

Unzipping selected files from a zip file

I have a zip file with internal folder structure as:

CODE
`-- CODE
    `-- CODE
        `-- CODE
            |-- 2019
            |   |-- file1.txt
            |   `-- file2.txt
            |-- 2020
            |   `-- file3.txt
            `-- 2021
                |-- file4.txt
                `-- file5.txt

And I want to unzip the files in folder structure as given below:

CODE
|-- 2019
|   |-- file1.txt
|   `-- file2.txt
|-- 2020
|   `-- file3.txt
`-- 2021
    |-- file4.txt
    `-- file5.txt

I could hard code it, however, since it is a repeating request, can I programmatically handle this to unzip only folders which have files in them.

My current code is:

def unzipfiles(incoming_path):
    for path,subdirs,files in os.walk(incoming_path):
        for name in files:
            if(name.endswith('.zip')):
                with zipfile.ZipFile(os.path.join(incoming_path,name), 'r') as zip_ref:
                    for file in zip_ref.namelist():
                        out_path=os.path.join(incoming_path,file)
                        out_path=out_path.replace('CODE/','')
                        if(out_path[:-1]!=incoming_path):
                            zip_ref.extract(file,out_path)

However, it is not working correctly, and creating more folders than present in zip file.

This code works for me.

def removeEmptyFolders(path, removeRoot=True):
  if not os.path.isdir(path):
    return
  files = os.listdir(path)
  if len(files):
    for f in files:
      fullpath = os.path.join(path, f)
      if os.path.isdir(fullpath):
        removeEmptyFolders(fullpath)
  files = os.listdir(path)
  if len(files) == 0 and removeRoot:
    os.rmdir(path)

The solution that I use, is mapping the full path of the files, to a relative shorter name. For the solution I will take the zip structure as provided by the OP.

import os
import re
import pathlib
import shutil
import zipfile
from pprint import pprint

if __name__ == '__main__':
    toplevel = os.path.join('files')
    new_structure = dict()

    # Let's just extract everything
    with zipfile.ZipFile('CODE.zip', 'r') as zip_file:

        for zip_info in zip_file.infolist():
            path = pathlib.PurePath(zip_info.filename)

            # This writes the data from the old file to a new file.
            if str(path.parent) in new_structure:
                source = zip_file.open(zip_info)
                target = open(os.path.join(new_structure[str(path.parent)], path.name), "wb")

                with source, target:
                    shutil.copyfileobj(source, target)

            # Create the new folder structure mapping, based on the year name.
            # The matches are based on numbers in this example, but can be specified.
            if re.match('\d+', path.name):  
                new_structure[str(path)] = os.path.join(toplevel, path.name)
                os.makedirs(new_structure[str(path)], exist_ok=True)

    pprint(new_structure)

The output ( pprint ), shows the remapping structure:

{'CODE\\CODE\\CODE\\CODE\\2019': 'files\\2019',
 'CODE\\CODE\\CODE\\CODE\\2020': 'files\\2020',
 'CODE\\CODE\\CODE\\CODE\\2021': 'files\\2021'}

The output is a new folder with the following structure:

files
|-- 2019
|   |-- file1.txt
|   `-- file2.txt
|-- 2020
|   `-- file3.txt
`-- 2021
    |-- file4.txt
    `-- file5.txt

Notes

There are two interesting points to make:

  1. Regex pattern matching is used to determine the file paths '\d+' , which simply accepts a list of numbers, if you want to be more precise you can use \d{4} to exactly match four digits.

  2. This method only assumes one lower level, in other words, multiple nested files will not be unpacked properly. For this the line if str(path.parent) in new_structure: has to be changed to take into account multiple parents path.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM