簡體   English   中英

在 Python 中提取段落文本

[英]Extract paragraph text in Python

如何在搜索匹配段落標題后使用python搜索word文檔以提取段落文本,即“ 1.2 Broadspectrum Offer的摘要”。

即見下面的文檔示例,我基本上想得到以下文本“下面提供了我們提供投標文件中概述的工作范圍的報價摘要。請參閱我們報價的各種條款和條件如此處詳述。另請查找成本明細

1.  Executive Summary

1.1 Summary of Services
Energy Savings (Carbon Emissions and Intensity Reduction)
Upgrade Economy Cycle on Level 2,5,6,7 & 8, replace Chilled Water Valves on Level 6 & 8 and install lighting controls on L5 & 6..

1.2 Summary of Broadspectrum Offer

A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown 

請注意,標題編號從 doc 更改為 doc,並且不想依賴於此,因此我想依賴標題中的搜索文本

到目前為止,我可以搜索文檔,但這只是一個開始。

filename1 = "North Sydney TE SP30062590-1 HVAC - Project Offer -  Rev1.docx"

from docx import Document

document = Document(filename1)
for paragraph in document.paragraphs:
    if 'Summary' in paragraph.text:
        print paragraph.text

這是一個初步的解決方案(等待我對您上面帖子的評論的答復)。 這還沒有考慮Summary of Broadspectrum Offer部分之后排除其他段落。 如果需要,您很可能需要一個小的正則表達式匹配來確定您是否遇到了另一個帶有1.3 (等)的標題部分,如果是,則停止理解。 讓我知道這是否是一項要求。

編輯:將print()從列表理解方法轉換為標准for循環,以響應Anton vBR下面的評論。

from docx import Document

document = Document("North Sydney TE SP30062590-1 HVAC - Project Offer -  Rev1.docx")

# Find the index of the `Summary of Broadspectrum Offer` syntax and store it
ind = [i for i, para in enumerate(document.paragraphs) if 'Summary of Broadspectrum Offer' in para.text]
# Print the text for any element with an index greater than the index found in the list comprehension above
if ind:
    for i, para in enumerate(document.paragraphs):
        if i > ind[0]:
             print(para.text)    

[print(para.text) for i, para in enumerate(document.paragraphs) if ind and i > ind[0]]

>> A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. 
Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown 

此外,這是另一篇文章,可能有助於解決另一種方法,即使用段落元數據檢測heading類型: 從 word doc 中提取標題文本

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM