簡體   English   中英

如何在漂亮的湯對象中打印兩個短語之間的所有行?

[英]How to print all lines between two phrases in a beautiful soup object?

我有一個 html 文檔轉換成一個湯對象,我試圖打印兩個關鍵短語之間的所有文本行。 我正在使用soup.find 來搜索這兩個短語,但我不知道如何打印它們之間的所有行。 到目前為止,這是我的代碼:

file = open(r'PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.html', 'r', encoding='utf8')

contents = file.read()

soup = BS(contents, 'lxml')

textStart = soup.find(text=re.compile("19  domestic and stock rights"))
textEnd = soup.find(text = re.compile('20  native title rights'))

print(textStart)
print(textEnd)

html的一個例子在這里:

 <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:22061px; width:169px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">19  Domestic and stock rights 
<br>unsuitable for human consumption. Water from these water sources should not be 

<br>consumed without first being tested and if necessary, appropriately treated. Such testing 
<br>and treatment is the responsibility of the water user. 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24038px; width:31px; height:8px;"><span style="font-family: ArialMT; font-size:8px">Page 27 
<br></span></div>

<div style="position:absolute; top:24131px;"></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24202px; width:331px; height:19px;"><span style="font-family: ArialMT; font-size:9px"> 
<br>Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012  
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24231px; width:2px; height:22px;"><span style="font-family: ArialMT; font-size:9px"> 
<br></span><span style="font-family: TimesNewRomanPSMT; font-size:11px"> 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24261px; width:120px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">20  Native title rights 

您可以使用re模塊來提取文本。 例如:

import re
from bs4 import BeautifulSoup


txt = '''
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:22061px; width:169px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">19  Domestic and stock rights
<br>unsuitable for human consumption. Water from these water sources should not be

<br>consumed without first being tested and if necessary, appropriately treated. Such testing
<br>and treatment is the responsibility of the water user.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24038px; width:31px; height:8px;"><span style="font-family: ArialMT; font-size:8px">Page 27
<br></span></div>

<div style="position:absolute; top:24131px;"></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24202px; width:331px; height:19px;"><span style="font-family: ArialMT; font-size:9px">
<br>Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24231px; width:2px; height:22px;"><span style="font-family: ArialMT; font-size:9px">
<br></span><span style="font-family: TimesNewRomanPSMT; font-size:11px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24261px; width:120px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">20  Native title rights
'''

soup = BeautifulSoup(txt, 'html.parser')
raw_text = soup.get_text(strip=True, separator='\n')
t = re.search(r'19\s+domestic and stock rights(.*?)20\s+native title rights', raw_text, flags=re.S|re.I).group(1)
print(t)

印刷:

unsuitable for human consumption. Water from these water sources should not be
consumed without first being tested and if necessary, appropriately treated. Such testing
and treatment is the responsibility of the water user.
Page 27
Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM