[英]Extract paragraph text in Python
如何在搜索匹配段落標題后使用python搜索word文檔以提取段落文本,即“ 1.2 Broadspectrum Offer的摘要”。
即見下面的文檔示例,我基本上想得到以下文本“下面提供了我們提供投標文件中概述的工作范圍的報價摘要。請參閱我們報價的各種條款和條件如此處詳述。另請查找成本明細“
1. Executive Summary
1.1 Summary of Services
Energy Savings (Carbon Emissions and Intensity Reduction)
Upgrade Economy Cycle on Level 2,5,6,7 & 8, replace Chilled Water Valves on Level 6 & 8 and install lighting controls on L5 & 6..
1.2 Summary of Broadspectrum Offer
A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown
請注意,標題編號從 doc 更改為 doc,並且不想依賴於此,因此我想依賴標題中的搜索文本
到目前為止,我可以搜索文檔,但這只是一個開始。
filename1 = "North Sydney TE SP30062590-1 HVAC - Project Offer - Rev1.docx"
from docx import Document
document = Document(filename1)
for paragraph in document.paragraphs:
if 'Summary' in paragraph.text:
print paragraph.text
這是一個初步的解決方案(等待我對您上面帖子的評論的答復)。 這還沒有考慮在Summary of Broadspectrum Offer
部分之后排除其他段落。 如果需要,您很可能需要一個小的正則表達式匹配來確定您是否遇到了另一個帶有1.3
(等)的標題部分,如果是,則停止理解。 讓我知道這是否是一項要求。
編輯:將print()
從列表理解方法轉換為標准for
循環,以響應Anton vBR
下面的評論。
from docx import Document
document = Document("North Sydney TE SP30062590-1 HVAC - Project Offer - Rev1.docx")
# Find the index of the `Summary of Broadspectrum Offer` syntax and store it
ind = [i for i, para in enumerate(document.paragraphs) if 'Summary of Broadspectrum Offer' in para.text]
# Print the text for any element with an index greater than the index found in the list comprehension above
if ind:
for i, para in enumerate(document.paragraphs):
if i > ind[0]:
print(para.text)
[print(para.text) for i, para in enumerate(document.paragraphs) if ind and i > ind[0]]
>> A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below.
Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown
此外,這是另一篇文章,可能有助於解決另一種方法,即使用段落元數據檢測heading
類型: 從 word doc 中提取標題文本
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.