简体   繁体   English

使用 python-docx 从 .docx 文件中提取图像位置

[英]Extract image position from .docx file using python-docx

I'm trying to get the image index from the .docx file using python-docx library.我正在尝试使用python-docx库从 .docx 文件中获取图像索引。 I'm able to extract the name of the image, image height and width.我能够提取图像的名称、图像的高度和宽度。 But not the index where it is in the word file但不是它在单词文件中的索引

import docx
doc = docx.Document(filename)
for s in doc.inline_shapes:
    print (s.height.cm,s.width.cm,s._inline.graphic.graphicData.pic.nvPicPr.cNvPr.name)

output输出

21.228  15.920 IMG_20160910_220903848.jpg

In fact I would like to know if there is any simpler way to get the image name , like s.height.cm fetched me the height in cm.事实上,我想知道是否有任何更简单的方法来获取图像名称,例如s.height.cm以厘米为单位获取高度。 My primary requirement is to get to know where the image is in the document, because I need to extract the image and do some work on it and then again put the image back to the same location我的主要要求是了解图像在文档中的位置,因为我需要提取图像并对其进行一些工作,然后再次将图像放回同一位置

This operation is not directly supported by the API. API 不直接支持此操作。

However, if you're willing to dig into the internals a bit and use the underlying lxml API it's possible.但是,如果您愿意深入了解内部结构并使用底层的lxml API,这是可能的。

The general approach would be to access the ImagePart instance corresponding to the picture you want to inspect and modify, then read and write the ._blob attribute (which holds the image file as bytes).一般的方法是访问与要检查和修改的图片对应的ImagePart实例,然后读取和写入._blob属性(将图像文件保存为字节)。

This specimen XML might be helpful: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml这个样本 XML 可能会有所帮助: http : //python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml

From the inline shape containing the picture, you get the <a:blip> element with this:从包含图片的内联形状中,您将获得<a:blip>元素:

blip = inline_shape._inline.graphic.graphicData.pic.blipFill.blip

The relationship id (r:id generally, but r:embed in this case) is available at:关系 id(通常为 r:id,但在这种情况下为 r:embed)可在以下位置获得:

rId = blip.embed

Then you can get the image part from the document part然后你可以从文档部分获取图像部分

document_part = document.part
image_part = document_part.related_parts[rId]

And then the binary image is available for read and write on ._blob .然后二进制图像可用于在._blob._blob

If you write a new blob, it will replace the prior image when saved.如果您编写一个新的 blob,它将在保存时替换先前的图像。

You probably want to get it working with a single image and get a feel for it before scaling up to multiple images in a single document.您可能想让它处理单个图像并在将单个文档中的多个图像放大之前对其进行感受。

There might be one or two image characteristics that are cached, so you might not get all the finer points working until you save and reload the file, so just be alert for that.可能会缓存一两个图像特征,因此在保存并重新加载文件之前,您可能无法获得所有更精细的点,因此请对此保持警惕。

Not for the faint of heart as you can see, but should work if you want it bad enough and can trace through the code a bit :)正如您所看到的,不适合胆小的人,但如果您想要它足够糟糕并且可以稍微跟踪代码,则应该可以工作:)

You can also inspect paragraphs with a simple loop, and check which xml contains an image (for example if an xml contains "graphicData"), that is which is an image container (you can do the same with runs):您还可以使用简单的循环检查段落,并检查哪个 xml 包含图像(例如,如果 xml 包含“graphicData”),即哪个是图像容器(您可以对运行执行相同的操作):

from docx import Document

image_paragraphs = []
doc = Document(path_to_docx)
for par in doc.paragraphs:
    if 'graphicData' in par._p.xml:
        image_paragraphs.append(par)

Than you unzip docx file, images are in the "images" folder, and they are in the same order as they will be in the image_paragraphs list.解压缩 docx 文件后,图像位于“images”文件夹中,它们的顺序与它们在 image_paragraphs 列表中的顺序相同。 On every paragraph element you have many options how to change it.在每个段落元素上,您有很多选择如何更改它。 If you want to extract img process it and than insert it in the same place, than如果您想提取 img 处理它并将其插入到同一个地方,那么

paragraph.clear()
paragraph.add_run('your description, if needed')
run = paragraph.runs[0]
run.add_picture(path_to_pic, width, height)

So, I've never really written any answers here, but i think this might be the solution to your problem.所以,我从来没有真正在这里写过任何答案,但我认为这可能是您问题的解决方案。 With this little code you can see the position of your images given all the paragraphs.使用这个小代码,您可以看到给定所有段落的图像的位置。 Hope it helps.希望能帮助到你。

import docx

doc = docx.Document(filename)

paraGr = []             
index = []

par = doc.paragraphs
for i in range(len(par)):
     paraGr.append(par[i].text)
     if 'graphicData' in par[i]._p.xml:
         index.append(i)

If you are using Python 3如果您使用的是 Python 3

pip install python-docx

import docx
doc = docx.Document(document_path)
P = []
I = []
par = doc.paragraphs
for i in range(len(par)):
     P.append(par[i].text)
     if 'graphicData' in par[i]._p.xml:
         I.append(i)
print(I)

#returns list of index(Image_Reference) #返回索引列表(Image_Reference)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM