简体   繁体   English

使用BeautifulSoup从HTML提取文本

[英]Extracting Text from HTML Using BeautifulSoup

Hi I am trying extract text from a HTML using BeautifulSoup function in python- it runs well but I am not getting what I need. 嗨,我正在尝试使用python中的BeautifulSoup函数从HTML中提取文本-它运行得很好,但我没有得到所需的东西。 My code is of the following: 我的代码如下:

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
raw = BeautifulSoup(html).get_text()

Python console reports the following and I do not understand the problem and would appreciate the help. Python控制台报告了以下内容,我不理解该问题,将不胜感激。

raw = BeautifulSoup(html).get_text()
C:/Users/muradz14/.spyder-py3/raw.py:1: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 1 of the file C:/Users/muradz14/.spyder-py3/raw.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

That is just a warning. 那只是一个警告。 It's pretty self-explanatory, but there is some small chance that the code could behave differently with different parsers, so the warning is saying that you might want to specify what you use. 这是不言自明的,但是代码在不同的解析器中表现出不同的可能性很小,因此警告提示您可能要指定使用的内容。 You can do as it suggests like this: raw = BeautifulSoup(html, features="lxml").get_text() 您可以按照建议的方式进行操作,例如: raw = BeautifulSoup(html, features="lxml").get_text()

Note that some systems have different parsers. 请注意,某些系统具有不同的解析器。 For me, it's features="html.parser" 对我来说,它的features="html.parser"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM