简体   繁体   English

如何在漂亮的汤对象中打印两个短语之间的所有行?

[英]How to print all lines between two phrases in a beautiful soup object?

I have a html document converted into a soup object and I am trying to print all the lines of text between two key phrases.我有一个 html 文档转换成一个汤对象,我试图打印两个关键短语之间的所有文本行。 I am using soup.find to search for the two phrases but I don't know how to print all the lines between them.我正在使用soup.find 来搜索这两个短语,但我不知道如何打印它们之间的所有行。 Here is my code so far:到目前为止,这是我的代码:

file = open(r'PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.html', 'r', encoding='utf8')

contents = file.read()

soup = BS(contents, 'lxml')

textStart = soup.find(text=re.compile("19  domestic and stock rights"))
textEnd = soup.find(text = re.compile('20  native title rights'))

print(textStart)
print(textEnd)

An example of the html is here: html的一个例子在这里:

 <br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:22061px; width:169px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">19  Domestic and stock rights 
<br>unsuitable for human consumption. Water from these water sources should not be 

<br>consumed without first being tested and if necessary, appropriately treated. Such testing 
<br>and treatment is the responsibility of the water user. 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24038px; width:31px; height:8px;"><span style="font-family: ArialMT; font-size:8px">Page 27 
<br></span></div>

<div style="position:absolute; top:24131px;"></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24202px; width:331px; height:19px;"><span style="font-family: ArialMT; font-size:9px"> 
<br>Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012  
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24231px; width:2px; height:22px;"><span style="font-family: ArialMT; font-size:9px"> 
<br></span><span style="font-family: TimesNewRomanPSMT; font-size:11px"> 
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24261px; width:120px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">20  Native title rights 

You can use re module to extract the text.您可以使用re模块来提取文本。 For example:例如:

import re
from bs4 import BeautifulSoup


txt = '''
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:22061px; width:169px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">19  Domestic and stock rights
<br>unsuitable for human consumption. Water from these water sources should not be

<br>consumed without first being tested and if necessary, appropriately treated. Such testing
<br>and treatment is the responsibility of the water user.
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24038px; width:31px; height:8px;"><span style="font-family: ArialMT; font-size:8px">Page 27
<br></span></div>

<div style="position:absolute; top:24131px;"></div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24202px; width:331px; height:19px;"><span style="font-family: ArialMT; font-size:9px">
<br>Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24231px; width:2px; height:22px;"><span style="font-family: ArialMT; font-size:9px">
<br></span><span style="font-family: TimesNewRomanPSMT; font-size:11px">
<br></span></div><div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:72px; top:24261px; width:120px; height:11px;"><span style="font-family: Arial-BoldMT; font-size:11px">20  Native title rights
'''

soup = BeautifulSoup(txt, 'html.parser')
raw_text = soup.get_text(strip=True, separator='\n')
t = re.search(r'19\s+domestic and stock rights(.*?)20\s+native title rights', raw_text, flags=re.S|re.I).group(1)
print(t)

Prints:印刷:

unsuitable for human consumption. Water from these water sources should not be
consumed without first being tested and if necessary, appropriately treated. Such testing
and treatment is the responsibility of the water user.
Page 27
Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM