Python Twitter 爬虫的“哈希表”

Question

As part of the python twitter crawler I'm creating, I am attempting to make a "hash-table" of sorts to ensure that I don't crawl any user more than once.作为我正在创建的 python twitter 爬虫的一部分，我正在尝试制作一个“哈希表”，以确保我不会多次爬取任何用户。 It is below.它在下面。 However, I am running into some problems.但是，我遇到了一些问题。 When I start crawling at the user NYTimesKrugman, I seem to crawl some users more than once.当我开始抓取用户 NYTimesKrugman 时，我似乎不止一次地抓取了一些用户。 When I start crawling at the user cleversallie (in another completely independent crawl), I don't crawl any user more than once.当我开始对用户cleversallie 进行爬网（在另一个完全独立的爬网中）时，我不会对任何用户进行多次爬网。 Any insight into this behavior would be greatly appreciated!!!任何对此行为的见解将不胜感激！！！

from BeautifulSoup import BeautifulSoup
import re
import urllib2
import twitter

start_follower = "cleversallie" 
depth = 3

U = list()

api = twitter.Api()

def add_to_U(user):
   U.append(user)

def user_crawled(user):
   L = len(L)
   for x in (0, L):
      a = L[x]
      if a != user:
         return False
      else:
         return True

def turn_to_names(users):
    names = list()
    for u in users:
       x = u.screen_name
       names.append(x)
    return names

def test_users(users):
   new = list()
   for u in users:
      if (user_crawled):
         new.append(u)
   return new

def crawl(follower,in_depth): #main method of sorts
   if in_depth > 0:
      add_to_U(follower)
      users = api.GetFriends(follower)
      names = turn_to_names(users)
      select_users = test_users(names)
      for u in select_users[0:5]:
         crawl(u, in_depth - 1)

crawl(start_follower, depth)
for u in U:
   print u
print("Program done.")

EDIT Based on your suggestions (thank you all very much:) I have rewritten the code as the following:编辑根据您的建议（非常感谢大家：）我已将代码重写如下：

import re
import urllib2
import twitter

start_follower = "NYTimesKrugman"
depth = 4

searched = set()

api = twitter.Api()

def crawl(follower, in_depth):
    if in_depth > 0:
        searched.add(follower)
        users = api.GetFriends(follower)
        names = set([str(u.screen_name) for u in users])
        names -= searched
        for name in list(names)[0:5]:
            crawl(name, in_depth-1) 

crawl(start_follower, depth)
for x in searched:
    print x
print "Program is completed."

Answer 1

The code sample you've given just plain doesn't work for starters, but I would guess your problem has something to do with not even making a hashtable (dictionary? set?).您提供的简单代码示例不适用于初学者，但我猜您的问题与甚至没有制作哈希表（字典？设置？）有关。

You call L = len(L) when I cannot see anywhere else that L is defined.当我在其他任何地方都看不到定义 L 时，您调用L = len(L) 。 You then have a loop,然后你有一个循环，

for x in (0, L):
      a = L[x]
      if a != user:
         return False
      else:
         return True

which will actually just execute twice, once with x = 0 and once with x = L, where L is the len(L).实际上只会执行两次，一次是 x = 0，一次是 x = L，其中 L 是 len(L)。 Needless to say when you attempt to index into L the loop will fail.不用说，当您尝试对L进行索引时，循环将失败。 That won't even happen because you have an if-else that returns either way and L is not defined anywhere.这甚至不会发生，因为您有一个返回任一方式的 if-else 并且 L 没有在任何地方定义。

What you are most likely looking for is a set with a check for the user, do some work if they're absent, then add the user.您最有可能寻找的是一个带有用户检查的集合，如果他们不存在，请做一些工作，然后添加用户。 This might look like:这可能看起来像：

first_user = 'cleversallie'
crawled_users =  {first_user} #set literal

def crawl(user, depth, max_depth):
    friends = get_friends(first_user)
    for friend in friends:
        if friend not in crawled_users and depth < max_depth:
            crawled_users.add(friend)
            crawl(friend, depth + 1, max_depth)

crawl(first_user, 0, 5)

You can fill in the details of what happens in get friends.您可以在“交朋友”中填写详细信息。 Haven't tested this so pardon any syntax errors but it should be a strong start for you.尚未对此进行测试，因此请原谅任何语法错误，但这对您来说应该是一个很好的开始。

Answer 2

You have a bug where you set L = to len(L), not len(U).您有一个错误，您将 L = 设置为 len(L)，而不是 len(U)。 Also, you have a bug where you will return false if the first user does not match, not if every user does not match.此外，您有一个错误，如果第一个用户不匹配，您将返回 false，而不是如果每个用户都不匹配。 In Python, the same function may be written as either of the following:在 Python 中，同样的 function 可以写成以下任意一种：

def user_crawled(user):
  for a in l:
    if a == user:
      return True

  return False

def user_crawled(user):
  return user in a

The test_users function uses a user_crawled as a variable, it does not actually call it. test_users function 使用 user_crawled 作为变量，它实际上并没有调用它。 Also, it seems you are doing the inverse of what you intend, you wish new to be populated with untested users, not tested ones.此外，您似乎正在做与您的意图相反的事情，您希望 new 填充未经测试的用户，而不是经过测试的用户。 This is that function with the errors corrected:这是已纠正错误的 function：

def test_users(users):
   new = list()
   for u in users:
      if not user_crawled(u):
         new.append(u)
   return new

Using a generator function, you can further simplify the function (provided you intend on looping over the results):使用生成器 function，您可以进一步简化 function（前提是您打算循环遍历结果）：

def test_users(users):
   for u in users:
      if not user_crawled(u):
         yield u

You can also use the filter function:您还可以使用过滤器 function：

def test_users(users):
   return filter(lambda u: not user_crawled(u), users)

Your using a list to store users, not a hash-based structure.您使用列表来存储用户，而不是基于哈希的结构。 Python provides sets for when you need a list-like structure which can never have duplicates and requires fast existence tests. Python 提供集合，用于当您需要一个永远不会有重复且需要快速存在测试的类似列表的结构时。 Sets can also be subtracted to remove all the elements in one set from the other.也可以减去集合以从另一个集合中删除一个集合中的所有元素。

Also, your list (U) is of users, but you are matching it against user names.此外，您的列表 (U) 包含用户，但您将其与用户名匹配。 You need to store just the user name of each added user.您只需要存储每个添加的用户的用户名。 Also, you are using u to represent a user at one point in the program and to represent a user name at another, you should use more meaningful variable names.此外，您在程序中的某个位置使用 u 表示用户并在另一位置表示用户名，您应该使用更有意义的变量名称。

The syntactic sugar of python ends up eliminating the need for all of your functions. python 的语法糖最终消除了对所有函数的需求。 This is how I would rewrite the entire program:这就是我重写整个程序的方式：

import twitter

start_follower = "cleversallie" 
MAX_DEPTH = 3

searched = set()

api = twitter.Api()

def crawl(follower, in_depth=MAX_DEPTH):
   if in_depth > 0:
      searched.add(follower['screen_name'])

      users = api.GetFriends(follower)
      names = set([u['screen_name'] for u in users])

      names -= searched
      for name in list(names)[:5]:
         crawl(name, in_depth - 1)

crawl(start_follower)

print "\n".join(searched)
print("Program done.")

Answer 3

Let's start by saying there's lots of errors in this code a lot of non-python isms.让我们首先说这段代码中有很多错误很多非python 主义。

For instance:例如：

def user_crawled(user):
  L = len(U)
  for x in (0, L):
    a = L[x]
    if a != user:
      return False
    else:
      return True

This iterates only once through the loop... So you really ment something like [adding range() and the ability to check all the users.这只在循环中迭代一次......所以你真的需要[添加 range() 和检查所有用户的能力。

def user_crawled(user) :
  L = len(U)
  for x in range(0, L) :
    a = L[x]
    if a == user :
       return True
  return False

Now of course a slightly more python way would be to skip the range and just iterate over the loop.当然，现在稍微多一点的 python 方法是跳过范围并遍历循环。

def user_crawled(user) :
  for a in U :
    if a == user :
      return True
  return False

Which is nice an simple, but now in true python you would jump on the "in" operator and write:这很简单，但现在在真正的 python 中，您将跳上“in”运算符并编写：

def user_crawled(user) :
  return user in U

A few more python thoughts - list comprehensions.还有一些 python 想法 - 列出理解。

 def test_user(users) :
   return [u for u in users if user_crawled(u)]

Which could also be applied to turn_to_names() - left as an exercise to the reader.这也可以应用于 turn_to_names() - 作为练习留给读者。

Python Twitter 爬虫的“哈希表”

问题描述

3 个解决方案

解决方案1
3 2011-08-10 23:13:55

解决方案2
3 已采纳 2011-08-10 23:33:11

解决方案3
1 2011-08-10 23:28:18

Python Twitter 爬虫的“哈希表”

问题描述

3 个解决方案

解决方案1 3 2011-08-10 23:13:55

解决方案2 3 已采纳 2011-08-10 23:33:11

解决方案3 1 2011-08-10 23:28:18

解决方案1
3 2011-08-10 23:13:55

解决方案2
3 已采纳 2011-08-10 23:33:11

解决方案3
1 2011-08-10 23:28:18