简体   繁体   English

从多个 7-zip 文件中提取特定的文件扩展名

[英]Extract specific file extensions from multiple 7-zip files

I have a RAR file and a ZIP file.我有一个 RAR 文件和一个 ZIP 文件。 Within these two there is a folder.在这两个中有一个文件夹。 Inside the folder there are several 7-zip (.7z) files.文件夹内有几个 7-zip (.7z) 文件。 Inside every 7z there are multiple files with the same extension, but whose names vary.在每个 7z 中都有多个具有相同扩展名的文件,但它们的名称各不相同。

RAR or ZIP file
  |___folder
        |_____Multiple 7z
                  |_____Multiple files with same extension and different name

I want to extract just the ones I need from thousands of files... I need those files whose names include a certain substring.我只想从数千个文件中提取我需要的那些……我需要那些名称包含某个子字符串的文件。 For example, if the name of a compressed file includes '[!]' in the name or '(U)' or '(J)' that's the criteria to determine the file to be extracted.例如,如果压缩文件的名称中包含'[!]''(U)''(J)' ,这就是确定要提取的文件的条件。

I can extract the folder without problem so I have this structure:我可以毫无问题地提取文件夹,所以我有这个结构:

folder
   |_____Multiple 7z
                |_____Multiple files with same extension and different name

I'm in a Windows environment but I have Cygwin installed.我在 Windows 环境中,但我安装了 Cygwin。 I wonder how can I extract the files I need painlessly?我想知道如何轻松提取我需要的文件? Maybe using a single command line line.也许使用单个命令行。

Update更新

There are some improvements to the question:这个问题有一些改进:

  • The inner 7z files and their respective files inside them can have spaces in their names.内部 7z 文件及其内部的相应文件的名称中可以包含空格。
  • There are 7z files with just one file inside of them that doesn't meet the given criteria.有 7z 文件,其中只有一个文件不符合给定条件。 Thus, being the only possible file, they have to be extracted too.因此,作为唯一可能的文件,它们也必须被提取。

Solution解决方案

Thanks to everyone.谢谢大家。 The bash solution was the one that helped me out. bash 解决方案帮助了我。 I wasn't able to test Python3 solutions because I had problems trying to install libraries using pip .我无法测试 Python3 解决方案,因为我在尝试使用pip安装库时遇到了问题。 I don't use Python so I'll have to study and overcome the errors I face with these solutions.我不使用 Python,所以我必须研究并克服我在使用这些解决方案时遇到的错误。 For now, I've found a suitable answer.现在,我找到了一个合适的答案。 Thanks to everyone.谢谢大家。

This solution is based on bash, grep and awk, it works on Cygwin and on Ubuntu.此解决方案基于 bash、grep 和 awk,适用于 Cygwin 和 Ubuntu。

Since you have the requirement to search for (X) [!].ext files first and if there are no such files then look for (X).ext files, I don't think it is possible to write some single expression to handle this logic.由于您需要首先搜索(X) [!].ext文件,如果没有这样的文件,则查找(X).ext文件,我认为不可能编写一些单一的表达式来处理这个逻辑。

The solution should have some if/else conditional logic to test the list of files inside the archive and decide which files to extract.解决方案应该有一些 if/else 条件逻辑来测试存档中的文件列表并决定提取哪些文件。

Here is the initial structure inside the zip/rar archive I tested my script on (I made a script to prepare this structure):这是我测试脚本的 zip/rar 存档中的初始结构(我制作了一个脚本来准备这个结构):

folder
├── 7z_1.7z
│   ├── (E).txt
│   ├── (J) [!].txt
│   ├── (J).txt
│   ├── (U) [!].txt
│   └── (U).txt
├── 7z_2.7z
│   ├── (J) [b1].txt
│   ├── (J) [b2].txt
│   ├── (J) [o1].txt
│   └── (J).txt
├── 7z_3.7z
│   ├── (E) [!].txt
│   ├── (J).txt
│   └── (U).txt
└── 7z 4.7z
    └── test.txt

The output is this:输出是这样的:

output
├── 7z_1.7z           # This is a folder, not an archive
│   ├── (J) [!].txt   # Here we extracted only files with [!]
│   └── (U) [!].txt
├── 7z_2.7z
│   └── (J).txt       # Here there are no [!] files, so we extracted (J)
├── 7z_3.7z
│   └── (E) [!].txt   # We had here both [!] and (J), extracted only file with [!]
└── 7z 4.7z
    └── test.txt      # We had only one file here, extracted it

And this is the script to do the extraction:这是进行提取的脚本

#!/bin/bash

# Remove the output (if it's left from previous runs).
rm -r output
mkdir -p output

# Unzip the zip archive.
unzip data.zip -d output
# For rar use
#  unrar x data.rar output
# OR
#  7z x -ooutput data.rar

for archive in output/folder/*.7z
do
  # See https://stackoverflow.com/questions/7148604
  # Get the list of file names, remove the extra output of "7z l"
  list=$(7z l "$archive" | awk '
      /----/ {p = ++p % 2; next}
      $NF == "Name" {pos = index($0,"Name")}
      p {print substr($0,pos)}
  ')
  # Get the list of files with [!].
  extract_list=$(echo "$list" | grep "[!]")
  if [[ -z $extract_list ]]; then
    # If we don't have files with [!], then look for ([A-Z]) pattern
    # to get files with single letter in brackets.
    extract_list=$(echo "$list" | grep "([A-Z])\.")
  fi
  if [[ -z $extract_list ]]; then
    # If we only have one file - extract it.
    if [[ ${#list[@]} -eq 1 ]]; then
      extract_list=$list
    fi
  fi
  if [[ ! -z $extract_list ]]; then
    # If we have files to extract, then do the extraction.
    # Output path is output/7zip_archive_name/
    out_path=output/$(basename "$archive")
    mkdir -p "$out_path"
    echo "$extract_list" | xargs -I {} 7z x -o"$out_path" "$archive" {}
  fi
done

The basic idea here is to go over 7zip archives and get the list of files for each of them using 7z l command (list of files).这里的基本思想是检查 7zip 档案并使用7z l命令(文件列表)获取每个档案的文件列表。

The output of the command if quite verbose, so we use awk to clean it up and get the list of file names.该命令的输出相当冗长,因此我们使用awk进行清理并获取文件名列表。

After that we filter this list using grep to get either a list of [!] files or a list of (X) files.之后,我们使用grep过滤此列表以获取[!]文件列表或(X)文件列表。 Then we just pass this list to 7zip to extract the files we need.然后我们只需将这个列表传递给 7zip 来提取我们需要的文件。

What about using this command line :使用这个命令行怎么样:

7z -e c:\myDir\*.7z -oc:\outDir "*(U)*.ext" "*(J)*.ext" "*[!]*.ext" -y

Where :在哪里 :

  • myDir is your unzip folder myDir 是你的解压文件夹
  • outDir is your output directory outDir 是你的输出目录
  • ext is your file extension ext 是你的文件扩展名

The -y option is for forcing overwriting in case you have the same filename in different archives. -y 选项用于强制覆盖,以防您在不同的档案中有相同的文件名。

This is somehow final version after some tries.经过一些尝试,这是不知何故的最终版本。 Previous was not useful so I'm removing it, instead of appending.以前没有用,所以我将其删除,而不是附加。 Read till the end, since not everything may be needed for final solution.读到最后,因为不是所有的东西都需要最终解决方案。

To the topic.到主题。 I would use Python.我会使用Python。 If that is one time task, then it can be overkill, but in any other case - you can log all steps for future investigation, regex, orchestrating some commands with providing input, and taking and processing output - each time.如果这是一个时间任务,那么它可能是矫枉过正的,但在任何其他情况下 - 您可以记录所有步骤以供将来调查、正则表达式、编排一些命令以提供输入以及获取和处理输出 - 每次。 All that cases are quite easy in Python.所有这些情况在 Python 中都很容易。 If you have it however.如果你有它。

Now, I'll write what to do to have env.现在,我将写下如何获得 env。 configured.配置。 Not all is mandatory, but trying install did some steps, and maybe description of the process can be beneficial itself.并非所有都是强制性的,但尝试安装做了一些步骤,也许对过程的描述本身可能是有益的。

I have MinGW - 32 bit version.我有MinGW - 32 位版本。 That is not mandatory to extract 7zip however.然而,这不是提取 7zip 的强制性要求。 When installed go to C:\\MinGW\\bin and run mingw-get.exe :安装后转到C:\\MinGW\\bin并运行mingw-get.exe

  • Basic Setup I have msys-base installed (right click, mark for installation, from Installation menu - Apply changes). Basic Setup我已经安装了msys-base (右键单击,标记安装,从安装菜单 - 应用更改)。 That way I have bash, sed, grep, and many more.这样我就有了 bash、sed、grep 等等。
  • In All Packages there is mingw32-libarchive with dll as class. Since pythonAll Packagesmingw32-libarchive with dll as class. Since python mingw32-libarchive with dll as class. Since python libarchive` package is just a wrapper you need this dll to actually have binary to wrap. mingw32-libarchive with dll as class. Since python libarchive` 包只是一个包装器,你需要这个 dll 来实际包装二进制文件。

Examples are for Python 3. I'm using 32 bit version.示例适用于 Python 3。我使用的是 32 位版本。 You can fetch it from their home page.您可以从他们的主页获取它。 I have installed in default directory which is strange.我已经安装在默认目录中,这很奇怪。 So advise is to install in root of your disk - like mingw.所以建议是安装在你磁盘的根目录 - 比如 mingw。

Other things - conemu is much better then default console.其他事情 - conemu比默认控制台好得多。

Installing packages in Python.在 Python 中安装包。 pip is used for that. pip就是用来做这个的。 From your console go to Python home, and there is Scripts subdirectory there.从您的控制台转到 Python 主页,那里有Scripts子目录。 For me it is: c:\\Users\\<<username>>\\AppData\\Local\\Programs\\Python\\Python36-32\\Scripts .对我来说是: c:\\Users\\<<username>>\\AppData\\Local\\Programs\\Python\\Python36-32\\Scripts You can search with for instance pip search archive , and install with pip install libarchive-c :您可以使用例如pip search archive进行pip search archive ,并使用pip install libarchive-c

> pip.exe install libarchive-c
Collecting libarchive-c
  Downloading libarchive_c-2.7-py2.py3-none-any.whl
Installing collected packages: libarchive-c
Successfully installed libarchive-c-2.7

After cd .. call python , and new library can be used / imported: cd ..调用python ,可以使用/导入新库:

>>> import libarchive
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 27, in <module>
    libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 426, in LoadLibrary
   return self._dlltype(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None

So it fails.所以它失败了。 I've tried to fix that, but failed with that:我试图解决这个问题,但失败了:

>>> import libarchive
read format "cab" is not supported
read format "7zip" is not supported
read format "rar" is not supported
read format "lha" is not supported
read filter "uu" is not supported
read filter "lzop" is not supported
read filter "grzip" is not supported
read filter "bzip2" is not supported
read filter "rpm" is not supported
read filter "xz" is not supported
read filter "none" is not supported
read filter "compress" is not supported
read filter "all" is not supported
read filter "lzma" is not supported
read filter "lzip" is not supported
read filter "lrzip" is not supported
read filter "gzip" is not supported
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 167, in <module>
    c_int, check_int)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 92, in ffi
    f = getattr(libarchive, 'archive_'+name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 361, in __getattr__
    func = self.__getitem__(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 366, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'archive_read_open_filename_w' not found

Tried with set command to directly provide information, but failed... So I moved to pylzma - for that mingw is not needed.尝试使用set命令直接提供信息,但失败了...所以我转向pylzma - 因为不需要 mingw。 pip install failed: pip安装失败:

> pip.exe install pylzma
Collecting pylzma
  Downloading pylzma-0.4.9.tar.gz (115kB)
    100% |--------------------------------| 122kB 1.3MB/s
Installing collected packages: pylzma
  Running setup.py install for pylzma ... error
    Complete output from command c:\users\texxas\appdata\local\programs\python\python36-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\texxas\\AppData\\Local\\Temp\\pip-build-99t_zgmz\\pylzma\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\texxas\AppData\Local\Temp\pip-ffe3nbwk-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build\lib.win32-3.6
    copying py7zlib.py -> build\lib.win32-3.6
    running build_ext
    adding support for multithreaded compression
    building 'pylzma' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

Again failed.又失败了。 But that is easy one - I've installed visual studio build tools 2015, and that worked.但这很容易 - 我已经安装了 2015 年的 Visual Studio 构建工具,并且奏效了。 I have sevenzip installed, so I've created sample archive.我已经安装了sevenzip ,所以我创建了示例存档。 So finally I can start python and do:所以最后我可以启动 python 并执行以下操作:

from py7zlib import Archive7z
f = open(r"C:\Users\texxas\Desktop\try.7z", 'rb')
a = Archive7z(f)
a.filenames

And got empty list.并得到空列表。 Looking closer... gives better understanding - empty files are not considered by pylzma - just to make you aware of that.仔细观察......可以更好地理解 - pylzma不考虑空文件 - 只是为了让您意识到这一点。 So putting one character into my sample files, last line gives:因此,将一个字符放入我的示例文件中,最后一行给出:

>>> a.filenames
['try/a/test.txt', 'try/a/test1.txt', 'try/a/test2.txt', 'try/a/test3.txt', 'try/a/test4.txt', 'try/a/test5.txt', 'try/a/test6.txt', 'try/a/test7.txt', 'try/b/test.txt', 'try/b/test1.txt', 'try/b/test2.txt', 'try/b/test3.txt', 'try/b/test4.txt', 'try/b/test5.txt', 'try/b/test6.txt', 'try/b/test7.txt', 'try/c/test.txt', 'try/c/test1.txt', 'try/c/test11.txt', 'try/c/test2.txt', 'try/c/test3.txt', 'try/c/test4.txt', 'try/c/test5.txt', 'try/c/test6.txt', 'try/c/test7.txt']

So... rest is a piece of cake.所以......休息是小菜一碟。 And actually that is a part of original post:实际上,这是原始帖子的一部分:

import os
import py7zlib

for folder, subfolders, files in os.walk('.'):
    for file in files:
        if file.endswith('.7z'):
            # sooo 7z archive - extract needed.
            try:
                with open(file, 'rb') as f:
                    z = py7zlib.Archive7z(f)
                    for file in z.list():
                        if arch.getinfo(file).filename.endswith('*.py'):
                            arch.extract(file, './dest')
            except py7zlib.FormatError as e:
                print ('file ' + file)
                print (str(e))  

As a side note - Anaconda is great tool, but full install takes 500+MB, so that is way too much.附带说明 - Anaconda 是很棒的工具,但完整安装需要 500+MB,所以太多了。

Also let me share wmctrl.py tool, from my github:也让我从我的 github 分享wmctrl.py工具:

cmd = 'wmctrl -ir ' + str(active.window) + \
      ' -e 0,' + str(stored.left) + ',' + str(stored.top) + ',' + str(stored.width) + ',' + str(stored.height)
print cmd
res = getoutput(cmd)

That way you can orchestrate different commands - here it is wmctrl .这样你就可以编排不同的命令——这里是wmctrl Result can be processed, in the way that allows data processing.可以以允许数据处理的方式处理结果。

You state it is OK to use linux, in the question bounty footer.您在问题赏金页脚中声明可以使用 linux。 And also I don't use windows.而且我也不使用窗户。 Sorry about that.对于那个很抱歉。 I am using Python3 on, and you have to be in a linux environment (I will try to test this on windows as soon as I can).我正在使用Python3 ,你必须在 linux 环境中(我会尽快在 windows 上测试这个)。

Archive structure档案结构

datadir.rar
          |
          datadir/
                 |
                 zip1.7z
                 zip2.7z
                 zip3.7z
                 zip4.7z
                 zip5.7z

Extracted structure提取结构

extracted/
├── zip1
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip2
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip3
│   ├── (J) [!].txt
│   └── (U) [!].txt
└── zip5
    ├── (J).txt
    └── (U).txt

Here is how I did it.这是我如何做到的。

import libarchive.public
import os, os.path
from os.path import basename
import errno
import rarfile

#========== FILE UTILS =================

#Make directories
def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else: raise

#Open "path" for writing, creating any parent directories as needed.
def safe_open_w(path):
    mkdir_p(os.path.dirname(path))
    return open(path, 'wb')

#========== RAR TOOLS ==================

# List
def rar_list(rar_archive):
    with rarfile.RarFile(rar_archive) as rf:
        return rf.namelist()

# extract
def rar_extract(rar_archive, filename, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extract(filename,path)

# extract-all
def rar_extract_all(rar_archive, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extractall(path)

#========= 7ZIP TOOLS ==================

# List
def zip7_list(zip7file):
    filelist = []
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            filelist.append(entry.pathname.decode("utf-8"))
    return filelist

# extract
def zip7_extract(zip7file, filename, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if entry.pathname.decode("utf-8") == filename:
                with safe_open_w(os.path.join(path, filename)) as q:
                    for block in entry.get_blocks():
                        q.write(block)
                break

# extract-all
def zip7_extract_all(zip7file, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if os.path.isdir(entry.pathname.decode("utf-8")):
                continue
            with safe_open_w(os.path.join(path, entry.pathname.decode("utf-8"))) as q:
                for block in entry.get_blocks():
                    q.write(block)

#============ FILE FILTER =================

def exclamation_filter(filename):
    return ("[!]" in filename)

def optional_code_filter(filename):
    return not ("[" in filename)

def has_exclamation_files(filelist):
    for singlefile in filelist:
        if(exclamation_filter(singlefile)):
            return True
    return False

#============ MAIN PROGRAM ================

print("-------------------------")
print("Program Started")
print("-------------------------")

BIG_RAR = 'datadir.rar'
TEMP_DIR = 'temp'
EXTRACT_DIR = 'extracted'
newzip7filelist = []

#Extract big rar and get new file list
for zipfilepath in rar_list(BIG_RAR):
    rar_extract(BIG_RAR, zipfilepath, TEMP_DIR)
    newzip7filelist.append(os.path.join(TEMP_DIR, zipfilepath))

print("7z Files Extracted")
print("-------------------------")

for newzip7file in newzip7filelist:
    innerFiles = zip7_list(newzip7file)
    for singleFile in innerFiles:
        fileSelected = False
        if(has_exclamation_files(innerFiles)):
            if exclamation_filter(singleFile): fileSelected = True
        else:
            if optional_code_filter(singleFile): fileSelected = True
        if(fileSelected):
            print(singleFile)
            outputFile = os.path.join(EXTRACT_DIR, os.path.splitext(basename(newzip7file))[0])
            zip7_extract(newzip7file, singleFile, outputFile)

print("-------------------------")
print("Extraction Complete")
print("-------------------------")

Above the main program, I've got all the required functions ready.在主程序之上,我已经准备好了所有必需的功能。 I didn't use all of them, but I kept them in case you need them.我没有使用所有这些,但我保留了它们以备您需要时使用。

I used several python libraries with python3 , but you only have to install libarchive and rarfile using pip , others are built-in libraries.我在python3使用了几个 python 库,但你只需要使用pip安装libarchiverarfile ,其他的都是内置库。

And here is a copy of my source tree这是我的源代码树副本

Console output控制台输出

This is the console output when you run this python file,这是运行此 python 文件时的控制台输出,

-------------------------
Program Started
-------------------------
7z Files Extracted
-------------------------
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(J).txt
(U).txt
-------------------------
Extraction Complete
-------------------------

Issues问题

The only issue I faced so far is, there are some temporary files generating at the program root.到目前为止,我面临的唯一问题是,在程序根目录生成了一些临时文件。 It doesn't affect the program in anyway, but I'll try to fix that.无论如何它都不会影响程序,但我会尝试修复它。

edit编辑

You have to run你必须跑

sudo apt-get install libarchive-dev

to install the actual libarchive program.安装实际的libarchive程序。 Python library is just a wrapper arround it. Python 库只是围绕它的一个包装器。 Take a look at the official documentation .看看官方文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM