Python 3.6從PPT文件中提取文本

Question

我正在使用textract python-pptx提取文件的文本內容，效果很好。 不幸的是，我們的客戶端也有需要處理的ppt文件，但是服務器中沒有任何MS Office / Open Office，因此我無法使用comtypes將ppt文件轉換為另一種文件類型，而只是執行從那里提取。

非常感謝其他方法的建議。

我在Windows 64位計算機上運行Python 3.6。

Answer 1

在這里轉換。 https://convertio.co/ppt-pptx/這將使您可以在程序中使用它們。

Answer 2

    from os.path import isfile, join
    import os
    import re
    from pptx import Presentation

   def getPptContent(path):
      prs = Presentation(path)
    text_runs = []
    for slide in prs.slides:
        for shape in slide.shapes:
            if not shape.has_text_frame:
                continue
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                   text_runs.append(run.text)
    return text_runs




ppt_dir = "ppt_data"

corpus = [str(f) for f in os.listdir(ppt_dir) if not f.startswith('.') and isfile(join(ppt_dir, f))]

for filename in corpus:
    Path = ppt_dir + "/" +filename
    print(Path)
    file_content = getPptContent(Path)
    f = open(ppt_dir + "/output/" + filename.split(".")[0]  +".txt" ,"w+", encoding="utf-8")
    f.write(str(file_content))
    f.close()

Python 3.6從PPT文件中提取文本

問題描述

2 個解決方案

解決方案1
0 2017-07-30 16:47:26

解決方案2
0 2018-12-28 08:06:58

Python 3.6從PPT文件中提取文本

問題描述

2 個解決方案

解決方案1 0 2017-07-30 16:47:26

解決方案2 0 2018-12-28 08:06:58

解決方案1
0 2017-07-30 16:47:26

解決方案2
0 2018-12-28 08:06:58