如何使用 python-pptx 在演示文稿中按順序從 powerpoint 文本框中提取文本。

Question

我的 PowerPoint 幻燈片由文本框組成，有時在組形狀內。 從這些中提取數據時，文本不是按順序提取的。 有時先提取ppt末尾的文本框，有時提取中間的文本框等等。

以下代碼從文本框中獲取文本並處理組對象。

for eachfile in files:    
    prs = Presentation(eachfile)
    textrun=[]
    # ---Only on text-boxes outside group elements---
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)
                textrun.append(shape.text)

        # ---Only operate on group shapes---
        group_shapes = [shp for shp in slide.shapes 
                        if shp.shape_type ==MSO_SHAPE_TYPE.GROUP]
        for group_shape in group_shapes:
            for shape in group_shape.shapes:
                if shape.has_text_frame:
                    print(shape.text)
                    textrun.append(shape.text)
    new_list=" ".join(textrun)
    text_list.append(new_list)

print(text_list)

我想根據它們在幻燈片中的出現順序過濾一些提取的數據。 函數根據什么決定順序？ 應該怎么做才能解決這個問題？

Answer 1

史蒂夫的評論非常正確； 返回的形狀：

for shape in slide.shapes:
    ...

是在底層 XML 的文檔順序中，這也是建立z-order 的原因。 Z-order 是“堆疊”順序，就好像每個形狀都在一個單獨的透明片（層）上，第一個返回的形狀在底部，每個后續形狀添加到堆棧的頂部（並與下面的任何形狀重疊） .

我認為你在這里追求的是從左到右，從上到下。 您需要使用shape.left和shape.top編寫自己的代碼以按此順序對形狀進行排序。

像這樣的事情可能會奏效：

def iter_textframed_shapes(shapes):
    """Generate shape objects in *shapes* that can contain text.

    Shape objects are generated in document order (z-order), bottom to top.
    """
    for shape in shapes:
        # ---recurse on group shapes---
        if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
            group_shape = shape
            for shape in iter_textable_shapes(group_shape.shapes):
                yield shape
            continue

        # ---otherwise, treat shape as a "leaf" shape---
        if shape.has_text_frame:
            yield shape

textable_shapes = list(iter_textframed_shapes(slide.shapes))
ordered_textable_shapes = sorted(
    textable_shapes, key=lambda shape: (shape.top, shape.left)
)

for shape in ordered_textable_shapes:
    print(shape.text)

如何使用 python-pptx 在演示文稿中按順序從 powerpoint 文本框中提取文本。

問題描述

1 個解決方案

解決方案1
2 已采納 2018-08-24 21:34:49

如何使用 python-pptx 在演示文稿中按順序從 powerpoint 文本框中提取文本。

問題描述

1 個解決方案

解決方案1 2 已采納 2018-08-24 21:34:49

解決方案1
2 已采納 2018-08-24 21:34:49