简体   繁体   English

Twitter Python爬网程序的爬网机制问题

[英]Problem with Crawling Mechanism of Twitter Python Crawler

Below is a small snippet of code I have for my twitter crawler mechanism: 以下是我的Twitter搜寻器机制的一小段代码:

from BeautifulSoup import BeautifulSoup
import re
import urllib2

url = 'http://mobile.twitter.com/NYTimesKrugman'

def gettweets(soup):
    tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
    for tag in tags: 
        print tag.renderContents()
        print ('\n\n')

def are_more_tweets(soup):#to check whether there is more than one page on mobile   twitter 
    links = soup.findAll('a', {'href': True}, {id: 'more_link'})
    for link in links:
        b = link.renderContents()
        test_b = str(b)
        if test_b.find('more'):
            return True
        else:
            return False

def getnewlink(soup): #to get the link to go to the next page of tweets on twitter 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are older than 3 months
    times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
    for time in times:
        stamp = time.renderContents()
        test_stamp = str(stamp)
        if test_stamp == '3 months ago':  
            print test_stamp
            return True
        else:
            return False


response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'stamp' + str(stamp)
print 'tweets' +str (tweets)
while (stamp is False) and (tweets is True): 
    b = getnewlink(soup)
    print b
    red = urllib2.urlopen(b)
    html = red.read()
    soup = BeautifulSoup(html)
    gettweets(soup)
    stamp = checkforstamp(soup)
    tweets = are_more_tweets(soup)
print 'done' 

The problem is, after my twitter crawler hits about 3 months of tweets, I would like it to stop going to the next page of a user. 问题是,在我的Twitter搜寻器点击了大约3个月的推文之后,我希望它不再转到用户的下一页。 However, it does not appear to be doing that. 但是,它似乎没有这样做。 It seems to continually going searching for the next page of tweets. 似乎一直在寻找下一页推文。 I believe this is due to the fact that checkstamp keeps evaluating to False. 我认为这是由于checkstamp不断评估为False所致。 Does anyone have any suggestions as to how I can modify the code so that the crawler keeps looking for the next page of tweets as long as there are more tweets (verified by are_more_tweets mechanism) and it hasn't hit 3 months of tweets yet??? 有没有人对我如何修改代码有任何建议,只要有更多推文(通过are_more_tweets机制验证)并且爬虫还未达到3个月,爬虫便会继续寻找下一页推文? ?? Thanks! 谢谢!

EDIT - Please see below: 编辑 -请参阅以下内容:

from BeautifulSoup import BeautifulSoup
import re
import urllib

url = 'http://mobile.twitter.com/cleversallie'
output = open(r'C:\Python28\testrecursion.txt', 'a') 

def gettweets(soup):
    tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
    for tag in tags: 
        a = tag.renderContents()
        b = str (a)
        print(b)
        print('\n\n')

def are_more_tweets(soup):#to check whether there is more than one page on mobile twitter 
    links = soup.findAll('a', {'href': True}, {id: 'more_link'})
    for link in links:
        b = link.renderContents()
        test_b = str(b)
        if test_b.find('more'):
            return True
        else:
            return False

def getnewlink(soup): #to get the link to go to the next page of tweets on twitter 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

 def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are older than 3 months
    times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
    for time in times:
        stamp = time.renderContents()
        test_stamp = str(stamp)
        if not (test_stamp[0]) in '0123456789':
            continue
        if test_stamp == '3 months ago':
            print test_stamp
            return True
        else:
            return False


response = urllib.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
while (not stamp) and (tweets): 
    b = getnewlink(soup)
    print b
    red = urllib.urlopen(b)
    html = red.read()
    soup = BeautifulSoup(html)
    gettweets(soup)
    stamp = checkforstamp(soup)
    tweets = are_more_tweets(soup)
 print 'done' 

Your soup.findall() is picking up an image tag in a link that matches your pattern (has an href attribute and class status-link ). 您的soup.findall()在与您的模式匹配的链接(具有href属性和class status-link )中拾取图像标签。

Instead of always return ing on the very first link, try: 而不是总是在第一个链接上return ing,请尝试:

for time in times:
    stamp = time.renderContents()
    test_stamp = str(stamp)
    print test_stamp
    if not test_stamp[0] in '0123456789':
        continue
    if test_stamp == '3 months ago':  
        return True
    else:
        return False

Which will skip the link if it doesn't start with a number, so you might actually get to the right link. 如果链接不是以数字开头,它将跳过该链接,因此您实际上可能会找到正确的链接。 Keep that print statement in there so you can see if you hit some other kind of link that starts with a number that you also need to filter out. print声明保留在此处,以便您查看是否打了其他链接,该链接以您也需要过滤掉的数字开头。

Edit: What you were doing was always returning on the very first item in times . 编辑:你在做什么总是第一个项目返回 times I changed it so it ignored any links that didn't start with a number. 我进行了更改,因此它忽略了所有不以数字开头的链接

However, this would cause it to return None if it didn't find any links with a number. 但是,这会导致它返回None ,如果它没有找到一个号码的任何链接。 This would work fine, except you changed while not stamp and tweets to while stamp is False and tweets is True . 这将很好用,除非您将while not stamp and tweets更改为while stamp is False and tweets is True Change it back to while not stamp and tweets and it will correctly treat None and False as the same, and it should work. 将其更改回while not stamp and tweets ,它将正确地将NoneFalse视为相同,并且它应该可以工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM