如何使用 BeautifulSoup 逐行打印文本？

Question

This is part of a sample test.html file:这是示例测试的一部分test.html文件：

<html>
<body>
<div>
...
...
<table class="width-max">
            <tr>
             <td style="max-width: 300px; min-width:300px;">
              <a href="nowhere.com">
               <h2>
                <b>
                 <font size="3">
                  My College
                 </font>
                </b>
               </h2>
              </a>
              <h4>
               <font size="2">
                My Name
               </font>
               <br/>
              </h4>
              My Address
              <br/>
              My City, XY 19604
              <br/>
              My Country
              <br/>
              <br/>
              Email:
              <a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
               example@nowhere.edu
              </a>
              <br/>
              Website:
              <a href="http://www.nowhere.edu" target="newwindow">
               http://www.nowhere.edu
              </a>
              <br/>
              <br/>
              <br/>
             </td>
              ...
              ...
</table>
<hr/>
<table class="width-max">
            <tr>
             <td style="max-width: 300px; min-width:300px;">
              <a href="nowhere.com">
               <h2>
                <b>
                 <font size="3">
                  His College
                 </font>
                </b>
               </h2>
              </a>
              <h4>
               <font size="2">
                His name
               </font>
               <br/>
              </h4>
              His Address
              <br/>
              His City, YX 49506
              <br/>
              His Country
              <br/>
              <br/>
              Phone: XX-YY-ZZ
              <br/>
              Email:
              <a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere2.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
               example@nowhere2.edu
              </a>
              <br/>
              Website:
              <a href="http://nowhere2.edu/" target="newwindow">
               http://nowhere2.edu
              </a>
              <br/>
              <br/>
              ...
              ...
</table>
...
...
</div>
</body>
</html>

The output I want:我想要的 output：

My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu

His College
His Name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://www.nowhere2.edu

At first I tried:起初我试过：

from bs4 import BeautifulSoup

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        print(table.get_text())

It prints the texts in new lines but produces bunch of blank lines and white spaces :它以新行打印文本，但会产生一堆blank lines和white spaces ：



         My College

      My Name
...

Then I tried:然后我尝试了：

from bs4 import BeautifulSoup

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        texts = ' '.join(table.text.split())
        print(texts)

It removes the blank lines and white spaces but combines all the texts in a single line:它删除了blank lines和white spaces ，但将所有文本组合在一行中：

My College My Name My Address ... ... http://www.nowhere2.edu

Finally I tried using strip() stripped_strings() method and I also tried to replace <br> with \n using replace_with() method.最后我尝试使用strip() stripped_strings()方法，我还尝试使用replace_with()方法将<br>替换为\n 。 But I am not yet successful to print out the exact output.但我还没有成功打印出确切的 output。

Answer 1

Try joining with a newline instead of space:尝试加入换行符而不是空格：

from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')
    for table in tables:
        texts = '\n'.join(table.text.split())
        print(texts)

Edit: Previous snippet would break your multiple word lines into single word lines, try this instead:编辑：上一个片段会将您的多个字线分成单个字线，请尝试以下操作：

from bs4 import BeautifulSoup    
with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')    
    tables = soup.find_all('table', class_='width-max')    
    for table in tables:
        if !table.get_text().isspace():
            text = os.linesep.join([l for l in table.get_text().splitlines() if l])
            print(text.lstrip())

Answer 2

Just change your print statement and add newline there like this只需更改您的打印语句并像这样添加换行符

print('\n' + texts)

Answer 3

You need to clean the table.get_text() values in order to print each line one after another.您需要清理table.get_text()值以便逐行打印每一行。
With 2 regex you can do that by使用 2 正则表达式，您可以通过

from bs4 import BeautifulSoup
import re

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        print(re.sub(r"(\n)+", r"\n", re.sub(r" {3,}", "", table.get_text().replace('...', ''))) , end="")

This will ouput这将输出

My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu    

His College
His name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://nowhere2.edu

The first regex {3,} will remove all 3 or more empty lines and the second "(\n)+", "\\n" will substitute \n more than one with one \n that will make the print function to print the data line by line.第一个正则表达式{3,}将删除所有 3 个或更多空行，第二个"(\n)+", "\\n"将用一个 \n 替换多个 \n，这将使打印 function 打印一行一行的数据。
In addition, to match you expected output the get_text().replace('...', '') added to remove... from the text.此外，为了匹配您预期的 output get_text().replace('...', '')添加以从文本中删除...。

如何使用 BeautifulSoup 逐行打印文本？

问题描述

3 个解决方案

解决方案1
0 2020-06-09 07:30:25

解决方案2
0 2020-06-09 07:31:45

解决方案3
0 已采纳 2020-06-09 07:47:58

如何使用 BeautifulSoup 逐行打印文本？

问题描述

3 个解决方案

解决方案1 0 2020-06-09 07:30:25

解决方案2 0 2020-06-09 07:31:45

解决方案3 0 已采纳 2020-06-09 07:47:58

解决方案1
0 2020-06-09 07:30:25

解决方案2
0 2020-06-09 07:31:45

解决方案3
0 已采纳 2020-06-09 07:47:58