[英]Problem with Crawling Mechanism of Twitter Python Crawler
以下是我的Twitter搜寻器机制的一小段代码:
from BeautifulSoup import BeautifulSoup
import re
import urllib2
url = 'http://mobile.twitter.com/NYTimesKrugman'
def gettweets(soup):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
print tag.renderContents()
print ('\n\n')
def are_more_tweets(soup):#to check whether there is more than one page on mobile twitter
links = soup.findAll('a', {'href': True}, {id: 'more_link'})
for link in links:
b = link.renderContents()
test_b = str(b)
if test_b.find('more'):
return True
else:
return False
def getnewlink(soup): #to get the link to go to the next page of tweets on twitter
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are older than 3 months
times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
for time in times:
stamp = time.renderContents()
test_stamp = str(stamp)
if test_stamp == '3 months ago':
print test_stamp
return True
else:
return False
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'stamp' + str(stamp)
print 'tweets' +str (tweets)
while (stamp is False) and (tweets is True):
b = getnewlink(soup)
print b
red = urllib2.urlopen(b)
html = red.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'done'
问题是,在我的Twitter搜寻器点击了大约3个月的推文之后,我希望它不再转到用户的下一页。 但是,它似乎没有这样做。 似乎一直在寻找下一页推文。 我认为这是由于checkstamp不断评估为False所致。 有没有人对我如何修改代码有任何建议,只要有更多推文(通过are_more_tweets机制验证)并且爬虫还未达到3个月,爬虫便会继续寻找下一页推文? ?? 谢谢!
编辑 -请参阅以下内容:
from BeautifulSoup import BeautifulSoup
import re
import urllib
url = 'http://mobile.twitter.com/cleversallie'
output = open(r'C:\Python28\testrecursion.txt', 'a')
def gettweets(soup):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
a = tag.renderContents()
b = str (a)
print(b)
print('\n\n')
def are_more_tweets(soup):#to check whether there is more than one page on mobile twitter
links = soup.findAll('a', {'href': True}, {id: 'more_link'})
for link in links:
b = link.renderContents()
test_b = str(b)
if test_b.find('more'):
return True
else:
return False
def getnewlink(soup): #to get the link to go to the next page of tweets on twitter
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are older than 3 months
times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
for time in times:
stamp = time.renderContents()
test_stamp = str(stamp)
if not (test_stamp[0]) in '0123456789':
continue
if test_stamp == '3 months ago':
print test_stamp
return True
else:
return False
response = urllib.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
while (not stamp) and (tweets):
b = getnewlink(soup)
print b
red = urllib.urlopen(b)
html = red.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'done'
您的soup.findall()
在与您的模式匹配的链接(具有href
属性和class
status-link
)中拾取图像标签。
而不是总是在第一个链接上return
ing,请尝试:
for time in times:
stamp = time.renderContents()
test_stamp = str(stamp)
print test_stamp
if not test_stamp[0] in '0123456789':
continue
if test_stamp == '3 months ago':
return True
else:
return False
如果链接不是以数字开头,它将跳过该链接,因此您实际上可能会找到正确的链接。 将print
声明保留在此处,以便您查看是否打了其他链接,该链接以您也需要过滤掉的数字开头。
编辑:你在做什么总是在第一个项目返回 times
。 我进行了更改,因此它忽略了所有不以数字开头的链接 。
但是,这会导致它返回None
,如果它没有找到一个号码的任何链接。 这将很好用,除非您将while not stamp and tweets
更改为while stamp is False and tweets is True
。 将其更改回while not stamp and tweets
,它将正确地将None
和False
视为相同,并且它应该可以工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.