简体   繁体   English

BeautifulSoup HTML 提取文本

[英]BeautifulSoup HTML extract text

I am working the first time with BeautifulSoup and am trying to extract a joke from a html (which is downloaded).我第一次使用 BeautifulSoup 并试图从 html (已下载)中提取一个笑话。 But unfortunately, there are no classes I can use to extract the information.但不幸的是,没有可以用来提取信息的类。

There is the line "beginning" and "end of the joke" and what I want is the title as well as the text of the joke.有“笑话的开头”和“笑话的结尾”这行,我想要的是笑话的标题和文本。 Attached you can find my code as well as the output.附上你可以找到我的代码以及 output。

from bs4 import BeautifulSoup

with open('init1.html', 'r') as f:
    contents = f.read()
    soup = BeautifulSoup(contents, 'lxml')   
    print(soup.prettify)

Output:
<bound method Tag.prettify of <html>
<head>
<title>Joke 1 of 25</title>
</head>
<body bgcolor="#fddf84" text="black">
<center>
<table cellpadding="0" cellspacing="0" width="620">
<td width="470">
<font size="+1"> <br/>
<!--begin of joke -->
A man visits the doctor. The doctor says "I have bad news for you.You have
cancer and Alzheimer's disease". <p>
The man replies "Well,thank God I don't have cancer!"
<!--end of joke -->
</p></font></td></table>
</center>
</body>
</html>
>

This is simple and worked:这很简单并且有效:

soup.table.td.text.strip()
# -> 'A man visits the doctor. The doctor says "I have bad news for you.You have\ncancer and Alzheimer\'s disease". \nThe man replies "Well,thank God I don\'t have cancer!"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM