使用Python Twitter Crawler循环问题

Question

I'm continuing writing my twitter crawler and am running into more problems. 我正在继续编写我的twitter爬虫，并遇到更多问题。 Take a look at the code below: 看看下面的代码：

from BeautifulSoup import BeautifulSoup
import re
import urllib2

url = 'http://mobile.twitter.com/NYTimesKrugman'

def gettweets(soup):
    tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
    for tag in tags: 
        print tag.renderContents()
        print ('\n\n')

def are_more_tweets(soup):#to check whether there is more than one page on mobile 
    links = soup.findAll('a', {'href': True}, {id: 'more_link'})
    for link in links:
        b = link.renderContents()
        test_b = str(b)
        if test_b.find('more'):
            return True
        else:
            return False

def getnewlink(soup): #to get the link to go to the next page of tweets on twitter 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are   
    times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
    for time in times:
        stamp = time.renderContents()
        test_stamp = str(stamp)
        if test_stamp.find('month'): 
            return True
        else:
            return False


response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'stamp' + str(stamp)
print 'tweets' +str (tweets)
while (not stamp) and tweets: 
    b = getnewlink(soup)
    print b
    red = urllib2.urlopen(b)
    html = red.read()
    soup = BeautifulSoup(html)
    gettweets(soup)
    stamp = checkforstamp(soup)
    tweets = are_more_tweets(soup)
print 'done'

The code works in the following way: For a single user NYTimesKrugman -I obtain all tweets on a single page(gettweets) -provided more tweets exist(are more tweets) and that I haven't obtained a month of tweets yet(checkforstamp), I get the link for the next page of tweets -I go to the next page of tweets (entering the while loop) and continue the process until one of the above conditions is violated 代码按以下方式工作：对于单个用户NYTimesKrugman -I获取单个页面上的所有推文（gettweets） - 提供更多推文（更多推文）并且我还没有获得一个月的推文（checkforstamp），我得到下一页推文的链接 - 我转到推文的下一页（进入while循环）并继续该过程，直到违反上述条件之一

However, I have done extensive testing and determined that I am not actually able to enter the while loop. 但是，我已经做了大量的测试，并确定我实际上无法进入while循环。 Rather, the program is not doing so. 相反，该计划没有这样做。 This is strange, because my code is written such that tweets are true and stamp should yield false. 这很奇怪，因为我的代码是这样编写的，推文是真的，而邮票应该产生错误。 However, I'm getting the below results: I am truly baffled! 但是，我得到了以下结果：我真的很困惑！

<div>
<span>
<strong><a href="http://mobile.twitter.com/nytimeskrugman">NYTimeskrugman</a></strong>
<span class="status">What Would I Have Done? <a rel="nofollow"   href="http://nyti.ms/nHxb8L" target="_blank" class="twitter_external_link">http://nyti.ms/nHxb8L</a></span>
</span>
<div class="list-tweet-status">
<a href="/nytimeskrugman/status/98046724089716739" class="status_link">3 days ago</a>
</div>
<div class="list-tweet-actions">
</div>
</div>




stampTrue
tweetsTrue
done
>>>

If someone could help that'd be great. 如果有人可以提供帮助，那就太好了。 Why can I not get more than 1 page of tweets? 为什么我不能获得超过1页的推文？ Is my parsing in checkstamp being done incorrectly? 我在checkstamp中的解析是不正确的？ Thanx. 感谢名单。

Answer 1

Your checkforstamp function returns non-empty, defined strings: 你的checkforstamp函数返回非空的，定义的字符串：

return 'True'

So (not stamp) will always be false. 所以(not stamp)总是假的。

Change it to return booleans like are_more_tweets does: 改变它以返回像are_more_tweets那样的布尔值：

return True

and it should be fine. 它应该没问题。

For reference, see the boolean operations documentation: 有关参考，请参阅布尔运算文档：

In the context of Boolean operations, and also when expressions are used by control flow statements, the following values are interpreted as false: False, None, numeric zero of all types, and empty strings and containers (including strings, tuples, lists, dictionaries, sets and frozensets). 在布尔运算的上下文中，以及控制流语句使用表达式时，以下值被解释为false：False，None，所有类型的数字零，以及空字符串和容器（包括字符串，元组，列表，字典），集和frozensets）。 All other values are interpreted as true. 所有其他值都被解释为true。

... ...

The operator not yields True if its argument is false, False otherwise. 如果参数为false，则运算符不产生True，否则返回False。

Edit: 编辑：

Same problem with the if test in checkforstamp . checkforstamp的if测试也存在同样的问题。 Since find('substr') returns -1 when the substring is not found, str.find('substr') in boolean context will be True if there is no match according to the rules above. 由于find('substr')在未找到子字符串时返回-1 ，因此如果根据上述规则没有匹配，则布尔上下文中的str.find('substr')将为True 。

That is not the only place in your code where this problem appears. 这不是代码中出现此问题的唯一位置。 Please review all your tests. 请检查所有测试。

Answer 2

if test_stamp.find('month'):

will evaluate to True if it doesn't find month , because it returns -1 when it doesn't find the substring. 如果找不到month ，将评估为True ，因为当它找不到子字符串时返回-1 。 It would only evaluate to False here if month was at the beginning of the string, so its position was 0 . 如果month位于字符串的开头，那么它只会评估为False ，因此它的位置为0 。

You need 你需要

if test_stamp.find('month') != -1:

or just 要不就

return test_stamp.find('month') != -1

使用Python Twitter Crawler循环问题

问题描述

2 个解决方案

解决方案1
1 2011-08-04 20:54:40

解决方案2
1 已采纳 2011-08-04 21:18:09

使用Python Twitter Crawler循环问题

问题描述

2 个解决方案

解决方案1 1 2011-08-04 20:54:40

解决方案2 1 已采纳 2011-08-04 21:18:09

解决方案1
1 2011-08-04 20:54:40

解决方案2
1 已采纳 2011-08-04 21:18:09