简体   繁体   中英

How to print all lines between two phrases in a beautiful soup object?

I have a html document converted into a soup object and I am trying to print all the lines of text between two key phrases. I am using soup.find to search for the two phrases but I don't know how to print all the lines between them. Here is my code so far:

file = open(r'PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.html', 'r', encoding='utf8')

contents = file.read()

soup = BS(contents, 'lxml')

textStart = soup.find(text=re.compile("19  domestic and stock rights"))
textEnd = soup.find(text = re.compile('20  native title rights'))

print(textStart)
print(textEnd)

An example of the html is here:

 <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:22061px; width:169px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">19  Domestic and stock rights 
<br>unsuitable for human consumption. Water from these water sources should not be 

<br>consumed without first being tested and if necessary, appropriately treated. Such testing 
<br>and treatment is the responsibility of the water user. 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24038px; width:31px; height:8px;"><span style="font-family: ArialMT; font-size:8px">Page 27 
<br></span></div>

<div style="position:absolute; top:24131px;"></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24202px; width:331px; height:19px;"><span style="font-family: ArialMT; font-size:9px"> 
<br>Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012  
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24231px; width:2px; height:22px;"><span style="font-family: ArialMT; font-size:9px"> 
<br></span><span style="font-family: TimesNewRomanPSMT; font-size:11px"> 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24261px; width:120px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">20  Native title rights 

You can use re module to extract the text. For example:

import re
from bs4 import BeautifulSoup


txt = '''
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:22061px; width:169px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">19  Domestic and stock rights
<br>unsuitable for human consumption. Water from these water sources should not be

<br>consumed without first being tested and if necessary, appropriately treated. Such testing
<br>and treatment is the responsibility of the water user.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24038px; width:31px; height:8px;"><span style="font-family: ArialMT; font-size:8px">Page 27
<br></span></div>

<div style="position:absolute; top:24131px;"></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24202px; width:331px; height:19px;"><span style="font-family: ArialMT; font-size:9px">
<br>Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24231px; width:2px; height:22px;"><span style="font-family: ArialMT; font-size:9px">
<br></span><span style="font-family: TimesNewRomanPSMT; font-size:11px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24261px; width:120px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">20  Native title rights
'''

soup = BeautifulSoup(txt, 'html.parser')
raw_text = soup.get_text(strip=True, separator='\n')
t = re.search(r'19\s+domestic and stock rights(.*?)20\s+native title rights', raw_text, flags=re.S|re.I).group(1)
print(t)

Prints:

unsuitable for human consumption. Water from these water sources should not be
consumed without first being tested and if necessary, appropriately treated. Such testing
and treatment is the responsibility of the water user.
Page 27
Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM