简体繁体 English

python docx2txt 无序提取图片

[英]python docx2txt extract images without order

原文 2022-05-18 09:36:56 0 1 python/ image/ orders/ docx2txt

I am using docx2txt to extract images in docx file docx file has multiple images, and all are extracted but order is not same as in docx.我正在使用 docx2txt 提取 docx 文件中的图像 docx 文件有多个图像，所有图像都被提取但顺序与 docx 中的不同。 For example, it extract images with image1.png, image2.png, image3.png (names) But actually, image3.png is very top image in docx so it should be named image1.png.例如，它提取图像 image1.png, image2.png, image3.png (names) 但实际上，image3.png 在 docx 中是非常顶级的图像，所以它应该命名为 image1.png。 Is there any option to extract images and name it as ordered in docx?是否有任何选项可以提取图像并将其命名为 docx 中的命令？

1 个解决方案

I looked through the source code of the library named docx2txt and couldn't find a code block where it renames image files.我查看了名为docx2txt的库的源代码，但找不到重命名图像文件的代码块。 I guess it's the text editor you're using that names the images that way.我猜是您使用的文本编辑器以这种方式命名图像。 I used "Microsoft Word 2013" in all the tests and I always saw that it numbered the images according to the order in the document.我在所有测试中都使用“Microsoft Word 2013”，我总是看到它根据文档中的顺序对图像进行编号。

As far as I understand, docx files are created by zipping xml and media (image, video etc.) files together.据我了解，docx 文件是通过将 xml 和媒体（图像、视频等）文件压缩在一起创建的。 There may be software like Microsoft Word that name the files in the zip. Maybe you are processing docx files created with a different version or other software.可能有像 Microsoft Word 这样的软件将文件命名为 zip。也许您正在处理使用不同版本或其他软件创建的 docx 文件。 The software you are using may be naming the newly added file directly with the last number instead of renaming all the images when a new media is added.您使用的软件可能会直接用最后一个数字命名新添加的文件，而不是在添加新媒体时重命名所有图像。