简体   繁体   English

使用python-docx从docx文件读取图像

[英]Read images from docx file with python-docx

I have a docx file which contains images, shown as below in unzipped document.xml format. 我有一个docx文件,其中包含图像,如下所示,以未压缩的document.xml格式显示。 Here, the particular images file is referred to by its id within the docx structure: rId5 . 在这里,特定的图像文件由其在docx结构中的id引用: rId5

<w:p>
  <w:pPr>
    <w:framePr w:h="13450" w:wrap="notBeside" w:vAnchor="text" w:hAnchor="text" w:xAlign="center" w:y="1"/>
    <w:widowControl w:val="0"/>
    <w:jc w:val="center"/>
    <w:rPr>
      <w:sz w:val="2"/>
      <w:szCs w:val="2"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:pict>
      <v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
        <v:stroke joinstyle="miter"/>
        <v:formulas>
          <v:f eqn="if lineDrawn pixelLineWidth 0"/>
          <v:f eqn="sum @0 1 0"/>
          <v:f eqn="sum 0 0 @1"/>
          <v:f eqn="prod @2 1 2"/>
          <v:f eqn="prod @3 21600 pixelWidth"/>
          <v:f eqn="prod @3 21600 pixelHeight"/>
          <v:f eqn="sum @0 0 1"/>
          <v:f eqn="prod @6 1 2"/>
          <v:f eqn="prod @7 21600 pixelWidth"/>
          <v:f eqn="sum @8 21600 0"/>
          <v:f eqn="prod @7 21600 pixelHeight"/>
          <v:f eqn="sum @10 21600 0"/>
        </v:formulas>
        <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
        <o:lock v:ext="edit" aspectratio="t"/>
      </v:shapetype>
      <v:shape id="_x0000_s1026" type="#_x0000_t75" style="width:486pt;height:673pt;">
        <v:imagedata r:id="rId5" r:href="rId6"/>
      </v:shape>
    </w:pict>
  </w:r>
</w:p>

I tried to use the document.inline_shapes property to read the images, but the following prints 0: 我尝试使用document.inline_shapes属性读取图像,但是以下内容打印0:

PATH = "/home/amoe/test.docx"
doc = docx.Document(PATH)
print(len(doc.inline_shapes))

Is there any other way I can read this data? 我还有其他方法可以读取此数据吗? I can see that the image is contained within a 'run', but I can't see any way to use the API of the docx.text.Run class to access the image. 我可以看到该图像包含在“运行”中,但是我看不到使用docx.text.Run类的API来访问该图像的任何方法。 The id of the imagedata element would be enough. imagedata元素的id就足够了。

Refer to python-docx 0.8.9 documentation 请参阅 python-docx 0.8.9文档

Word documents have two layers, a text layer and a drawing layer. Word文档有两层,文本层和图形层。 When a picture appears in the text layer it is called an inline picture. 当图片出现在文本层中时,称为嵌入式图片。 At the time of writing, python-docx only supports inline pictures. 在撰写本文时,python-docx仅支持嵌入式图片。

I assume your pictures in the drawing layer, so you can't read the pictures by python-docx. 我假设您的图片位于绘图层中,因此您无法通过python-docx读取图片。

You can read this post https://stackoverflow.com/a/27705408/8484506 您可以阅读这篇文章https://stackoverflow.com/a/27705408/8484506

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM