Python：从HTML获取段落

Question

我正在遍历链接列表，以获取所有奥巴马的演讲。 但是，对于某些链接，其html格式如下：

<p><font face="Verdana, Arial, Helvetica, sans-serif" size="3">If 
              there is anyone out there who still doubts that America is a place 
              where all things are possible; who still wonders if the dream of 
              our founders is alive in our time; who still questions the power 
              of our democracy, tonight is your answer.</font></p>
<p><font face="Verdana, Arial, Helvetica, sans-serif" size="3">It’s 
              the answer told by lines that stretched around schools and churches 
              in numbers this nation has never seen; by people who waited three 
              hours and four hours, many for the very first time in their lives, 
              because they believed that this time must be different; that their 
              voice could be that difference.</font></p>
<p><font face="Verdana, Arial, Helvetica, sans-serif" size="3">It’s 
              the answer spoken by young and old, rich and poor, Democrat and 
              Republican, black, white, Latino, Asian, Native American, gay, straight, 
              disabled and not disabled – Americans who sent a message to 
              the world that we have never been a collection of Red States and 
              Blue States: we are, and always will be, the United States of America.</font></p>

如果我做soup.find_all('font') ，那么我只会得到其中一个段落，而不是整个段落。 但是，对于其他链接，它们的html格式可能类似于下面的文本， soup.find_all('font')将整个段落返回给我。

</font></strong><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><br/>
</font></font><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><br/>
            My fellow citizens:<br/>
<br/>
            I stand here today humbled by the task before us, grateful for the 
            trust you have bestowed, mindful of the sacrifices borne by our ancestors. 
            I thank President Bush for his service to our nation, as well as the 
            generosity and cooperation he has shown throughout this transition.<br/>
<br/>
            Forty-four Americans have now taken the presidential oath. The words 
            have been spoken during rising tides of prosperity and the still waters 
            of peace. Yet, every so often the oath is taken amidst gathering clouds 
            and raging storms. At these moments, America has carried on not simply 
            because of the skill or vision of those in high office, but because 
            We the People have remained faithful to the ideals of our forbearers, 
            and true to our founding documents.<br/>
<br/>
            So it has been. So it must be with this generation of Americans.<br/>
</font> <div align="left">

目标：我想获得整个演讲，而不仅仅是段落。 如何在python中使用beautifulsoup实现此目的？

这两个演讲来自：

http://obamaspeeches.com/E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm

http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm

Answer 1

不幸的是，由于它们不一定是标准的-它为您带来了更多工作，因为1个逻辑流不会完全满足它们。

但是，对于您列出的特定情况，可以执行以下任一操作：

选择包含font标签的父级，即table 。 （ 注意：由于该网站使用表布局，因此您需要一些逻辑来验证哪个表包含所需内容 ）

for table in soup.find_all('table'):
    if this_is_the_table_you_want:
        print(table.text)

-要么-

只需从您已有的标签构建字符串

speech_text = ""
for font in soup.find_all('font'):
    speech_text += font.text

Python：从HTML获取段落

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-06-30 12:10:18

Python：从HTML获取段落

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-06-30 12:10:18

解决方案1
1 已采纳 2014-06-30 12:10:18