简体   繁体   English

Python BeautifulSoup在URL上不起作用

[英]Python BeautifulSoup doesn't works on URL

I'm happy to join Stack Overflow :) First time i don't find an answer at my problem :) 我很高兴加入Stack Overflow :)第一次我找不到我的问题的答案:)

I would like to scrap "meta description" on url list (in a SQL data base). 我想在网址列表上(在SQL数据库中)删除“元描述”。

When I started my script, it gets "killed" without any error. 当我启动脚本时,它被“杀死”,没有任何错误。 It gets killed reading the 11th URL. 阅读第11个URL会被杀死。

I made some tests, and I identified an URL : " http://www.les-calories.com/famille-4.html " 我进行了一些测试,并确定了一个URL:“ http://www.les-calories.com/famille-4.html

So i made this test, reducing my code at minimum : 所以我做了这个测试,至少减少了我的代码:

# encoding=utf8 
from bs4 import BeautifulSoup
import urllib
html = urllib.urlopen(" http://www.les-calories.com/famille-4.html").read()
soup = BeautifulSoup(html)

And this code gets "killed" by the shell. 并且此代码被外壳“杀死”。

screen 屏幕

I don't understand why... 我不明白为什么

Thank you for your help :) 谢谢您的帮助 :)

It could be that you've not specified the parser in which case do the following. 可能是您没有指定解析器,在这种情况下,请执行以下操作。

soup = BeautifulSoup(html, "html.parser")

However, I think what is more likely is that there was just too much information in the HTML page. 但是,我认为更有可能的是HTML页面中的信息太多了。 What I'd do is use the python-requests package, and in the GET request, I'd set stream to True . 我要做的是使用python-requests包,然后在GET请求中,将stream设置为True Like so: 像这样:

>>> import requests
>>> resp = requests.get("http://www.les-calories.com/famille-4.html", stream=True)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(resp.text, "html.parser")
>>> soup.find("a")
<a href="http://www.fitadium.com/79-seche-et-definition-musculaire" target="_blank"><img border="0" height="60px" src="h
ttp://www.les-calories.com/images/234x60_pack-minceur-brule-graisses.gif" width="234px"/></a>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM