Python BeautifulSoup在URL上不起作用

Question

I'm happy to join Stack Overflow :) First time i don't find an answer at my problem :) 我很高兴加入Stack Overflow :)第一次我找不到我的问题的答案:)

I would like to scrap "meta description" on url list (in a SQL data base). 我想在网址列表上（在SQL数据库中）删除“元描述”。

When I started my script, it gets "killed" without any error. 当我启动脚本时，它被“杀死”，没有任何错误。 It gets killed reading the 11th URL. 阅读第11个URL会被杀死。

I made some tests, and I identified an URL : " http://www.les-calories.com/famille-4.html " 我进行了一些测试，并确定了一个URL：“ http://www.les-calories.com/famille-4.html ”

So i made this test, reducing my code at minimum : 所以我做了这个测试，至少减少了我的代码：

# encoding=utf8 
from bs4 import BeautifulSoup
import urllib
html = urllib.urlopen(" http://www.les-calories.com/famille-4.html").read()
soup = BeautifulSoup(html)

And this code gets "killed" by the shell. 并且此代码被外壳“杀死”。

screen 屏幕

I don't understand why... 我不明白为什么

Thank you for your help :) 谢谢您的帮助：）

Answer 1

It could be that you've not specified the parser in which case do the following. 可能是您没有指定解析器，在这种情况下，请执行以下操作。

soup = BeautifulSoup(html, "html.parser")

However, I think what is more likely is that there was just too much information in the HTML page. 但是，我认为更有可能的是HTML页面中的信息太多了。 What I'd do is use the python-requests package, and in the GET request, I'd set stream to True . 我要做的是使用python-requests包，然后在GET请求中，将stream设置为True 。 Like so: 像这样：

>>> import requests
>>> resp = requests.get("http://www.les-calories.com/famille-4.html", stream=True)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(resp.text, "html.parser")
>>> soup.find("a")
<a href="http://www.fitadium.com/79-seche-et-definition-musculaire" target="_blank"><img border="0" height="60px" src="h
ttp://www.les-calories.com/images/234x60_pack-minceur-brule-graisses.gif" width="234px"/></a>

Python BeautifulSoup在URL上不起作用

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-04-29 11:11:14

Python BeautifulSoup在URL上不起作用

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-04-29 11:11:14

解决方案1
1 已采纳 2016-04-29 11:11:14