简体   繁体   English

如何使用 Python 从多个 7z 中提取多个文件?

[英]How to extract many files from multiple 7z using Python?

I need to extract 700k jpg files that are dispersed among 50 7z files.我需要提取分散在 50 个 7z 文件中的 700k jpg 文件。 I have a txt file that has one row for each file I need.我有一个 txt 文件,其中一行对应我需要的每个文件。 The row contains the target 7z file and location and file name.该行包含目标 7z 文件以及位置和文件名。

This is what the txt file looks like:这是 txt 文件的样子:

A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg

I currently am able to extract files with Python but only from one 7z at a time.我目前能够使用 Python 提取文件,但一次只能从一个 7z 中提取文件。 I use this command to do that:我使用这个命令来做到这一点:

7zz e A0000to22000.7z @f1.txt

This is taking way too long though.不过,这花费的时间太长了。 Is there anyway to edit the command or use another approach so I can extract many different files from many different 7z files at once?无论如何编辑命令或使用另一种方法,以便我可以一次从许多不同的 7z 文件中提取许多不同的文件?

Updated Answer更新的答案

With the new information that there are lots of files to retrieve from each archive, a modified approach is needed.有了从每个存档中检索大量文件的新信息,需要一种修改的方法。

First we must generate a list of the files needed from each 7z archive, then process that list in parallel.首先,我们必须从每个 7z 存档中生成所需文件的列表,然后并行处理该列表。 So this code should do that:所以这段代码应该这样做:

awk -F, '{sub("7z","txt",$1); print $2 > $1}' joblist.txt

That should make a file called A20000to22000.txt that contains all the files to be extracted from the archive A20000to22000.7z and similarly for B20000to22000.7z it should produce B20000to22000.txt .这应该创建一个名为A20000to22000.txt的文件,其中包含要从存档A20000to22000.7z中提取的所有文件,同样对于B20000to22000.7z ,它应该生成B20000to22000.txt

Don't proceed past here till the files ending in .txt look correct.在以.txt结尾的文件看起来正确之前,请不要从这里继续。

Now we need to process the .txt files in parallel with GNU Parallel .现在我们需要使用GNU Parallel 并行处理.txt文件。 That should look something like this:这应该是这个样子:

parallel --dry-run 7zz e {.}.7z @{} ::: *to*.txt 

I used *to*.txt in order to avoid processing the original joblist.txt .我使用*to*.txt以避免处理原始的joblist.txt

If that command looks correct, remove --dry-run and run for real.如果该命令看起来正确,请删除--dry-run并真正运行。

Original Answer原始答案

Assuming joblist.txt looks like this:假设joblist.txt看起来像这样:

A20000to22000.7z, A20000to22000/rows/A21673.Lo1sign.jpg
B20000to22000.7z, B20000to22000/rows/B21673.Lo1sign.jpg

and that corresponds to needing to run a command like:这对应于需要运行如下命令:

7zz e A20000to22000.7z A20000to22000/rows/A21673.Lo1sign.jpg

you can do that in parallel with GNU Parallel like this:您可以像这样与GNU Parallel 并行执行此操作:

parallel --dry-run --colsep , 7zz e {1} {2} :::: joblist.txt

If it looks right, remove --dry-run and run for real.如果看起来正确,请删除--dry-run并真正运行。


Note that this is done in the terminal/shell and without Python, so it falls under the "another approach" you mentioned.请注意,这是在终端/shell 中完成的,没有 Python,因此它属于您提到的“另一种方法”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM