[英]Why does this pickle reach maximum recursion depth without recursion?
這是我的代碼,它不包含遞歸,但它在第一個pickle上達到最大遞歸深度...
碼:
#!/usr/bin/env python
from bs4 import BeautifulSoup
from urllib2 import urlopen
import pickle
# open page and return soup list
def get_page_startups(page_url):
html = urlopen(page_url).read()
soup = BeautifulSoup(html, "lxml")
return soup.find_all("div","startup item")
#
# Get certain text from startup soup
#
def get_name(startup):
return startup.find("a", "profile").string
def get_website(startup):
return startup.find("a", "visit")["href"]
def get_status(startup):
return startup.find("p","status").strong.string[8:]
def get_twitter(startup):
return startup.find("a", "comment").string
def get_high_concept_pitch(startup):
return startup.find("div","headline").find_all("em")[1].string
def get_elevator_pitch(startup):
startup_soup = BeautifulSoup(urlopen("http://startupli.st" + startup.find("a","profile")["href"]).read(),"lxml")
return startup_soup.find("p", "desc").string.rstrip().lstrip()
def get_tags(startup):
return startup.find("p","tags").string
def get_blog(startup):
try:
return startup.find("a","visit blog")["href"]
except TypeError:
return None
def get_facebook(startup):
try:
return startup.find("a","visit facebook")["href"]
except TypeError:
return None
def get_angellist(startup):
try:
return startup.find("a","visit angellist")["href"]
except TypeError:
return None
def get_linkedin(startup):
try:
return startup.find("a","visit linkedin")["href"]
except TypeError:
return None
def get_crunchbase(startup):
try:
return startup.find("a","visit crunchbase")["href"]
except TypeError:
return None
# site to scrape
BASE_URL = "http://startupli.st/startups/latest/"
# scrape all pages
for page_no in xrange(1,142):
startups = get_page_startups(BASE_URL + str(page_no))
# search soup and pickle data
for i, startup in enumerate(startups):
s = {}
s['name'] = get_name(startup)
s['website'] = get_website(startup)
s['status'] = get_status(startup)
s['high_concept_pitch'] = get_high_concept_pitch(startup)
s['elevator_pitch'] = get_elevator_pitch(startup)
s['tags'] = get_tags(startup)
s['twitter'] = get_twitter(startup)
s['facebook'] = get_facebook(startup)
s['blog'] = get_blog(startup)
s['angellist'] = get_angellist(startup)
s['linkedin'] = get_linkedin(startup)
s['crunchbase'] = get_crunchbase(startup)
f = open(str(i)+".pkl", "wb")
pickle.dump(s,f)
f.close()
print "Done " + str(page_no)
這是引發異常后的0.pkl
的內容:
http://pastebin.com/DVS1GKzz千行!
在泡菜中有一些來自BASE_URL的HTML ......但是我沒有腌制任何html字符串......
BeautifulSoup .string
屬性實際上不是字符串:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div>Foo</div>')
>>> soup.find('div').string
u'Foo'
>>> type(soup.find('div').string)
bs4.element.NavigableString
嘗試使用str(soup.find('div').string)
,看看它是否有幫助。 另外,我不認為Pickle真的是這里最好的格式。 在這種情況下,JSON更容易。
很可能pickle在內部進行遞歸,而你正在嘗試解析的文件很大。 您可以嘗試增加允許的遞歸數限制。
import sys
sys.setrecursionlimit(10000)
但是,這不建議用於任何類型的生產就緒應用程序,因為它可能掩蓋實際問題,但可以幫助突出調試期間的問題。
Pickle無法處理BeautifulSoup節點。 類似的問題和一些解決方法:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.