[英]Problem with Crawling Mechanism of Twitter Python Crawler
Below is a small snippet of code I have for my twitter crawler mechanism: 以下是我的Twitter搜寻器机制的一小段代码:
from BeautifulSoup import BeautifulSoup
import re
import urllib2
url = 'http://mobile.twitter.com/NYTimesKrugman'
def gettweets(soup):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
print tag.renderContents()
print ('\n\n')
def are_more_tweets(soup):#to check whether there is more than one page on mobile twitter
links = soup.findAll('a', {'href': True}, {id: 'more_link'})
for link in links:
b = link.renderContents()
test_b = str(b)
if test_b.find('more'):
return True
else:
return False
def getnewlink(soup): #to get the link to go to the next page of tweets on twitter
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are older than 3 months
times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
for time in times:
stamp = time.renderContents()
test_stamp = str(stamp)
if test_stamp == '3 months ago':
print test_stamp
return True
else:
return False
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'stamp' + str(stamp)
print 'tweets' +str (tweets)
while (stamp is False) and (tweets is True):
b = getnewlink(soup)
print b
red = urllib2.urlopen(b)
html = red.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'done'
The problem is, after my twitter crawler hits about 3 months of tweets, I would like it to stop going to the next page of a user. 问题是,在我的Twitter搜寻器点击了大约3个月的推文之后,我希望它不再转到用户的下一页。 However, it does not appear to be doing that.
但是,它似乎没有这样做。 It seems to continually going searching for the next page of tweets.
似乎一直在寻找下一页推文。 I believe this is due to the fact that checkstamp keeps evaluating to False.
我认为这是由于checkstamp不断评估为False所致。 Does anyone have any suggestions as to how I can modify the code so that the crawler keeps looking for the next page of tweets as long as there are more tweets (verified by are_more_tweets mechanism) and it hasn't hit 3 months of tweets yet???
有没有人对我如何修改代码有任何建议,只要有更多推文(通过are_more_tweets机制验证)并且爬虫还未达到3个月,爬虫便会继续寻找下一页推文? ?? Thanks!
谢谢!
EDIT - Please see below: 编辑 -请参阅以下内容:
from BeautifulSoup import BeautifulSoup
import re
import urllib
url = 'http://mobile.twitter.com/cleversallie'
output = open(r'C:\Python28\testrecursion.txt', 'a')
def gettweets(soup):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
a = tag.renderContents()
b = str (a)
print(b)
print('\n\n')
def are_more_tweets(soup):#to check whether there is more than one page on mobile twitter
links = soup.findAll('a', {'href': True}, {id: 'more_link'})
for link in links:
b = link.renderContents()
test_b = str(b)
if test_b.find('more'):
return True
else:
return False
def getnewlink(soup): #to get the link to go to the next page of tweets on twitter
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are older than 3 months
times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
for time in times:
stamp = time.renderContents()
test_stamp = str(stamp)
if not (test_stamp[0]) in '0123456789':
continue
if test_stamp == '3 months ago':
print test_stamp
return True
else:
return False
response = urllib.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
while (not stamp) and (tweets):
b = getnewlink(soup)
print b
red = urllib.urlopen(b)
html = red.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'done'
Your soup.findall()
is picking up an image tag in a link that matches your pattern (has an href
attribute and class
status-link
). 您的
soup.findall()
在与您的模式匹配的链接(具有href
属性和class
status-link
)中拾取图像标签。
Instead of always return
ing on the very first link, try: 而不是总是在第一个链接上
return
ing,请尝试:
for time in times:
stamp = time.renderContents()
test_stamp = str(stamp)
print test_stamp
if not test_stamp[0] in '0123456789':
continue
if test_stamp == '3 months ago':
return True
else:
return False
Which will skip the link if it doesn't start with a number, so you might actually get to the right link. 如果链接不是以数字开头,它将跳过该链接,因此您实际上可能会找到正确的链接。 Keep that
print
statement in there so you can see if you hit some other kind of link that starts with a number that you also need to filter out. 将
print
声明保留在此处,以便您查看是否打了其他链接,该链接以您也需要过滤掉的数字开头。
Edit: What you were doing was always returning on the very first item in times
. 编辑:你在做什么总是在第一个项目返回
times
。 I changed it so it ignored any links that didn't start with a number. 我进行了更改,因此它忽略了所有不以数字开头的链接 。
However, this would cause it to return None
if it didn't find any links with a number. 但是,这会导致它返回
None
,如果它没有找到一个号码的任何链接。 This would work fine, except you changed while not stamp and tweets
to while stamp is False and tweets is True
. 这将很好用,除非您将
while not stamp and tweets
更改为while stamp is False and tweets is True
。 Change it back to while not stamp and tweets
and it will correctly treat None
and False
as the same, and it should work. 将其更改回
while not stamp and tweets
,它将正确地将None
和False
视为相同,并且它应该可以工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.