[英]Beautiful Soup Cleaning and Errors
I have this code: 我有以下代码:
from bs4 import BeautifulSoup
import urllib2
from lxml import html
from lxml.etree import tostring
trees = urllib2.urlopen('http://aviationweather.gov/adds/metars/index? station_ids=KJFK&std_trans=translated&chk_metars=on&hoursStr=most+recent+only&ch k_tafs=on&submit=Submit').read()
soup = BeautifulSoup(open(trees))
print soup.get_text()
item=soup.findAll(id="info")
print item
However, when I type soup on my window it gives me an error and when my program runs it gives me a very long html code with 但是,当我在窗口上键入汤时,它给我一个错误,而当我的程序运行时,它给了我很长的html代码,
and so on. 等等。 Any help would be greatful. 任何帮助将是巨大的。
The first problem is in this part: 第一个问题在这部分中:
trees = urllib2.urlopen('http://aviationweather.gov/adds/metars/index?station_ids=KJFK&std_trans=translated&chk_metars=on&hoursStr=most+recent+only&chk_tafs=on&submit=Submit').read()
soup = BeautifulSoup(open(trees))
trees
is a file-like object, there is no need to call open()
on it, fix it: trees
是一个类似文件的对象,不需要对其调用open()
进行修复:
soup = BeautifulSoup(trees, "html.parser")
We are also explicitly setting the html.parser
as an underlying parser. 我们还明确地将html.parser
设置为基础解析器。
Then, you need to be specific about what you are going to extract from a page. 然后,您需要明确要从页面中提取的内容。 Here is the example code to get the METAR text
value: 这是获取METAR text
值的示例代码:
from bs4 import BeautifulSoup
import urllib2
trees = urllib2.urlopen('http://aviationweather.gov/adds/metars/index?station_ids=KJFK&std_trans=translated&chk_metars=on&hoursStr=most+recent+only&chk_tafs=on&submit=Submit').read()
soup = BeautifulSoup(trees, "html.parser")
item = soup.find("strong", text="METAR text:").find_next("strong").get_text(strip=True).replace("\n", "")
print item
Prints KJFK 220151Z 20016KT 10SM BKN250 24/21 A3007 RMK AO2 SLP183 T02440206
. 打印KJFK 220151Z 20016KT 10SM BKN250 24/21 A3007 RMK AO2 SLP183 T02440206
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.