[英]How can I print texts line by line input using BeautifulSoup?
This is part of a sample test.html
file:这是示例测试的一部分
test.html
文件:
<html>
<body>
<div>
...
...
<table class="width-max">
<tr>
<td style="max-width: 300px; min-width:300px;">
<a href="nowhere.com">
<h2>
<b>
<font size="3">
My College
</font>
</b>
</h2>
</a>
<h4>
<font size="2">
My Name
</font>
<br/>
</h4>
My Address
<br/>
My City, XY 19604
<br/>
My Country
<br/>
<br/>
Email:
<a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
example@nowhere.edu
</a>
<br/>
Website:
<a href="http://www.nowhere.edu" target="newwindow">
http://www.nowhere.edu
</a>
<br/>
<br/>
<br/>
</td>
...
...
</table>
<hr/>
<table class="width-max">
<tr>
<td style="max-width: 300px; min-width:300px;">
<a href="nowhere.com">
<h2>
<b>
<font size="3">
His College
</font>
</b>
</h2>
</a>
<h4>
<font size="2">
His name
</font>
<br/>
</h4>
His Address
<br/>
His City, YX 49506
<br/>
His Country
<br/>
<br/>
Phone: XX-YY-ZZ
<br/>
Email:
<a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere2.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
example@nowhere2.edu
</a>
<br/>
Website:
<a href="http://nowhere2.edu/" target="newwindow">
http://nowhere2.edu
</a>
<br/>
<br/>
...
...
</table>
...
...
</div>
</body>
</html>
The output I want:我想要的 output:
My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu
His College
His Name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://www.nowhere2.edu
At first I tried:起初我试过:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
print(table.get_text())
It prints the texts in new lines but produces bunch of blank lines
and white spaces
:它以新行打印文本,但会产生一堆
blank lines
和white spaces
:
My College
My Name
...
Then I tried:然后我尝试了:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
texts = ' '.join(table.text.split())
print(texts)
It removes the blank lines
and white spaces
but combines all the texts in a single line:它删除了
blank lines
和white spaces
,但将所有文本组合在一行中:
My College My Name My Address ... ... http://www.nowhere2.edu
Finally I tried using strip()
stripped_strings()
method and I also tried to replace <br>
with \n
using replace_with()
method.最后我尝试使用
strip()
stripped_strings()
方法,我还尝试使用replace_with()
方法将<br>
替换为\n
。 But I am not yet successful to print out the exact output.但我还没有成功打印出确切的 output。
Try joining with a newline instead of space:尝试加入换行符而不是空格:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
texts = '\n'.join(table.text.split())
print(texts)
Edit: Previous snippet would break your multiple word lines into single word lines, try this instead:编辑:上一个片段会将您的多个字线分成单个字线,请尝试以下操作:
from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
if !table.get_text().isspace():
text = os.linesep.join([l for l in table.get_text().splitlines() if l])
print(text.lstrip())
Just change your print statement and add newline there like this只需更改您的打印语句并像这样添加换行符
print('\n' + texts)
You need to clean the table.get_text()
values in order to print each line one after another.您需要清理
table.get_text()
值以便逐行打印每一行。
With 2 regex you can do that by使用 2 正则表达式,您可以通过
from bs4 import BeautifulSoup
import re
with open('test.html', 'r') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
tables = soup.find_all('table', class_='width-max')
for table in tables:
print(re.sub(r"(\n)+", r"\n", re.sub(r" {3,}", "", table.get_text().replace('...', ''))) , end="")
This will ouput这将输出
My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu
His College
His name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://nowhere2.edu
The first regex {3,}
will remove all 3 or more empty lines and the second "(\n)+", "\\n"
will substitute \n more than one with one \n that will make the print function to print the data line by line.第一个正则表达式
{3,}
将删除所有 3 个或更多空行,第二个"(\n)+", "\\n"
将用一个 \n 替换多个 \n,这将使打印 function 打印一行一行的数据。
In addition, to match you expected output the get_text().replace('...', '')
added to remove... from the text.此外,为了匹配您预期的 output
get_text().replace('...', '')
添加以从文本中删除...。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.