简体   繁体   English

如何使用 BeautifulSoup 逐行打印文本?

[英]How can I print texts line by line input using BeautifulSoup?

This is part of a sample test.html file:这是示例测试的一部分test.html文件:

<html>
<body>
<div>
...
...
<table class="width-max">
            <tr>
             <td style="max-width: 300px; min-width:300px;">
              <a href="nowhere.com">
               <h2>
                <b>
                 <font size="3">
                  My College
                 </font>
                </b>
               </h2>
              </a>
              <h4>
               <font size="2">
                My Name
               </font>
               <br/>
              </h4>
              My Address
              <br/>
              My City, XY 19604
              <br/>
              My Country
              <br/>
              <br/>
              Email:
              <a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
               example@nowhere.edu
              </a>
              <br/>
              Website:
              <a href="http://www.nowhere.edu" target="newwindow">
               http://www.nowhere.edu
              </a>
              <br/>
              <br/>
              <br/>
             </td>
              ...
              ...
</table>
<hr/>
<table class="width-max">
            <tr>
             <td style="max-width: 300px; min-width:300px;">
              <a href="nowhere.com">
               <h2>
                <b>
                 <font size="3">
                  His College
                 </font>
                </b>
               </h2>
              </a>
              <h4>
               <font size="2">
                His name
               </font>
               <br/>
              </h4>
              His Address
              <br/>
              His City, YX 49506
              <br/>
              His Country
              <br/>
              <br/>
              Phone: XX-YY-ZZ
              <br/>
              Email:
              <a href="javascript:NewWindow=window.open('nowhere.com;email=example@nowhere2.edu','NewWindow','width=600,height=600,menubar=0');NewWindow.focus()">
               example@nowhere2.edu
              </a>
              <br/>
              Website:
              <a href="http://nowhere2.edu/" target="newwindow">
               http://nowhere2.edu
              </a>
              <br/>
              <br/>
              ...
              ...
</table>
...
...
</div>
</body>
</html>

The output I want:我想要的 output:

My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu

His College
His Name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://www.nowhere2.edu

At first I tried:起初我试过:

from bs4 import BeautifulSoup

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        print(table.get_text())

It prints the texts in new lines but produces bunch of blank lines and white spaces :它以新行打印文本,但会产生一堆blank lineswhite spaces



         My College

      My Name
...

Then I tried:然后我尝试了:

from bs4 import BeautifulSoup

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        texts = ' '.join(table.text.split())
        print(texts)

It removes the blank lines and white spaces but combines all the texts in a single line:它删除了blank lineswhite spaces ,但将所有文本组合在一行中:

My College My Name My Address ... ... http://www.nowhere2.edu

Finally I tried using strip() stripped_strings() method and I also tried to replace <br> with \n using replace_with() method.最后我尝试使用strip() stripped_strings()方法,我还尝试使用replace_with()方法将<br>替换为\n But I am not yet successful to print out the exact output.但我还没有成功打印出确切的 output。

Try joining with a newline instead of space:尝试加入换行符而不是空格:

from bs4 import BeautifulSoup
with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')
    for table in tables:
        texts = '\n'.join(table.text.split())
        print(texts)

Edit: Previous snippet would break your multiple word lines into single word lines, try this instead:编辑:上一个片段会将您的多个字线分成单个字线,请尝试以下操作:

from bs4 import BeautifulSoup    
with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')    
    tables = soup.find_all('table', class_='width-max')    
    for table in tables:
        if !table.get_text().isspace():
            text = os.linesep.join([l for l in table.get_text().splitlines() if l])
            print(text.lstrip())

Just change your print statement and add newline there like this只需更改您的打印语句并像这样添加换行符

print('\n' + texts)

You need to clean the table.get_text() values in order to print each line one after another.您需要清理table.get_text()值以便逐行打印每一行。
With 2 regex you can do that by使用 2 正则表达式,您可以通过

from bs4 import BeautifulSoup
import re

with open('test.html', 'r') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')
    tables = soup.find_all('table', class_='width-max')

    for table in tables:
        print(re.sub(r"(\n)+", r"\n", re.sub(r" {3,}", "", table.get_text().replace('...', ''))) , end="")

This will ouput这将输出

My College
My Name
My Address
My City, XY 19604
My Country
Email:
example@nowhere.edu
Website:
http://www.nowhere.edu    

His College
His name
His Address
His City, YX 49506
His Country
Phone: XX-YY-ZZ
Email:
example@nowhere2.edu
Website:
http://nowhere2.edu

The first regex {3,} will remove all 3 or more empty lines and the second "(\n)+", "\\n" will substitute \n more than one with one \n that will make the print function to print the data line by line.第一个正则表达式{3,}将删除所有 3 个或更多空行,第二个"(\n)+", "\\n"将用一个 \n 替换多个 \n,这将使打印 function 打印一行一行的数据。
In addition, to match you expected output the get_text().replace('...', '') added to remove... from the text.此外,为了匹配您预期的 output get_text().replace('...', '')添加以从文本中删除...。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM