![](/img/trans.png)
[英]How to extract text from a text shape within a Group Shape in powerpoint, using python-pptx.
[英]How to extract text from powerpoint text boxes, in their order within the presentation using python-pptx.
我的 PowerPoint 幻燈片由文本框組成,有時在組形狀內。 從這些中提取數據時,文本不是按順序提取的。 有時先提取ppt末尾的文本框,有時提取中間的文本框等等。
以下代碼從文本框中獲取文本並處理組對象。
for eachfile in files:
prs = Presentation(eachfile)
textrun=[]
# ---Only on text-boxes outside group elements---
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
print(shape.text)
textrun.append(shape.text)
# ---Only operate on group shapes---
group_shapes = [shp for shp in slide.shapes
if shp.shape_type ==MSO_SHAPE_TYPE.GROUP]
for group_shape in group_shapes:
for shape in group_shape.shapes:
if shape.has_text_frame:
print(shape.text)
textrun.append(shape.text)
new_list=" ".join(textrun)
text_list.append(new_list)
print(text_list)
我想根據它們在幻燈片中的出現順序過濾一些提取的數據。 函數根據什么決定順序? 應該怎么做才能解決這個問題?
史蒂夫的評論非常正確; 返回的形狀:
for shape in slide.shapes:
...
是在底層 XML 的文檔順序中,這也是建立z-order 的原因。 Z-order 是“堆疊”順序,就好像每個形狀都在一個單獨的透明片(層)上,第一個返回的形狀在底部,每個后續形狀添加到堆棧的頂部(並與下面的任何形狀重疊) .
我認為你在這里追求的是從左到右,從上到下。 您需要使用shape.left
和shape.top
編寫自己的代碼以按此順序對形狀進行排序。
像這樣的事情可能會奏效:
def iter_textframed_shapes(shapes):
"""Generate shape objects in *shapes* that can contain text.
Shape objects are generated in document order (z-order), bottom to top.
"""
for shape in shapes:
# ---recurse on group shapes---
if shape.shape_type == MSO_SHAPE_TYPE.GROUP:
group_shape = shape
for shape in iter_textable_shapes(group_shape.shapes):
yield shape
continue
# ---otherwise, treat shape as a "leaf" shape---
if shape.has_text_frame:
yield shape
textable_shapes = list(iter_textframed_shapes(slide.shapes))
ordered_textable_shapes = sorted(
textable_shapes, key=lambda shape: (shape.top, shape.left)
)
for shape in ordered_textable_shapes:
print(shape.text)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.