简体   繁体   English

使用 python 网络爬虫抓取 Twitter 帐户

[英]Using a python web crawler to scrape twitter accounts

I'm writing this program for my A-Level Computer Science coursework, and I am trying to get a crawler to scrape all the found users from a given users following/followed list.我正在为我的 A-Level 计算机科学课程编写这个程序,我试图让一个爬虫从给定的用户关注/关注列表中抓取所有找到的用户。

The start of the script is as followed:脚本的开头如下:

import requests
# import database as db
from bs4 import BeautifulSoup

debug = True


def getStartNode():  # Get the Twitter profile of the starting node
    global startNodeFollowing  # Declare the nodes vars as global for use in external functions
    global startNodeFollowers
    global startNodeLink
    if not debug:  # If debugging == False, allow the user to enter any starting node Twitter profile
        startNodeLink = input("Enter a link to the starting users Twitter profile\n[URL]: ")[:-1]  # Get profile link, remove the last char from input (space char, needed to enter link in terminal)
    else:  # If debugging == True, have predetermined starting node to save time during development
        startNodeLink = ("https://twitter.com/ckjellberg03")
    startNodeFollowers = (startNodeLink + "/followers")  # Create a new var using the starting node's Twitter profile, append for followers and following URL pages
    startNodeFollowing = (startNodeLink + "/following")

And the crawler is here:爬虫在这里:

def spider():  # Web Crawler
    getStartNode()
    print("\nUsing:", startNodeLink)

    urlFollowers = startNodeFollowers
    sourceCode = requests.get(urlFollowers)
    plainText = sourceCode.text  # Source code of the URL (urlFollowers) in plain text format
    soup = BeautifulSoup(plainText,'lxml')  # BeautifulSoup object to search through plainText for specific items/classes etc
    for link in soup.findAll('a', {'class': 'css-4rbku5 css-18t94o4 css-1dbjc4n r-1loqt21 r-1wbh5a2 r-dnmrzs r-1ny4l3l'}):  # 'a' is a link in HTML (anchor), class is the Twitter class for a profile
        href = link.get(href)
        print(href) # Display everything found (development purposes)

I'm pretty sure the class identifier for a users link to their Twitter profile from a /followers is "css-4rbku5 css-18t94o4 css-1dbjc4n r-1loqt21 r-1wbh5a2 r-dnmrzs r-1ny4l3l" from looking at source code, but printing results displays nothing.通过查看源代码,我很确定用户从 /followers 链接到他们的 Twitter 个人资料的类标识符是“css-4rbku5 css-18t94o4 css-1dbjc4n r-1loqt21 r-1wbh5a2 r-dnmrzs r-1ny4l3l”,但打印结果什么也没显示。

Any advice to point me in the right direction?有什么建议可以为我指明正确的方向吗?

Thanks!谢谢!

It's pretty difficult to scrape Twitter (trust me I have try every way), you can use Twitter API but they have limitation (you can't have the name of the followers only the number) if you want to scrape some information with Twitter API you can use this code:抓取 Twitter 非常困难(相信我,我已经尝试了各种方法),如果您想使用 Twitter API 抓取一些信息,您可以使用 Twitter API,但它们有限制(您不能只知道关注者的姓名)您可以使用此代码:

from TwitterAPI import TwitterAPI, TwitterPager
import tweepy
from tweepy import Cursor
from datetime import datetime, date, time, timedelta

consumer_key = 'consumer key'
consumer_secret = 'consumer secret'
token = 'token'
token_secret = 'token secret'

auth= tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(token, token_secret)
api = tweepy.API(auth)

account_list = ['POTUS44']



for target in account_list:
    print("Getting data for " + target)
    item = api.get_user(target)
    print("name: " + item.name)
    print("screen_name: " + item.screen_name)
    print("description: " + item.description)
    print("statuses_count: " + str(item.statuses_count))
    print("friends_count: " + str(item.friends_count))
    print("followers_count: " + str(item.followers_count))

    tweets = item.statuses_count
    account_created_date = item.created_at
    delta = datetime.utcnow() - account_created_date
    account_age_days = delta.days
    print("Account age (in days): " + str(account_age_days))
    if account_age_days > 0:
      print("Average tweets per day: " + "%.2f"%(float(tweets)/float(account_age_days)))

    tweets = item.statuses_count
    account_created_date = item.created_at
    delta = datetime.utcnow() - account_created_date
    account_age_days = delta.days
    print("Account age (in days): " + str(account_age_days))
    if account_age_days > 0:
      print("Average tweets per day: " + "%.2f"%(float(tweets)/float(account_age_days)))

    hashtags = []
    mentions = []
    tweet_count = 0
    end_date = datetime.utcnow() - timedelta(days=30)
    for status in Cursor(api.user_timeline, id=target).items():
      tweet_count += 1
      if hasattr(status, "entities"):
        entities = status.entities
        if "hashtags" in entities:
          for ent in entities["hashtags"]:
            if ent is not None:
              if "text" in ent:
                hashtag = ent["text"]
                if hashtag is not None:
                  hashtags.append(hashtag)
        if "user_mentions" in entities:
          for ent in entities["user_mentions"]:
            if ent is not None:
              if "screen_name" in ent:
                name = ent["screen_name"]
                if name is not None:
                  mentions.append(name)
      if status.created_at < end_date:
        break

Here is how to do it without API.这是没有 API 的方法。 Some difficulties stem from using the right browser in User-Agent,一些困难源于在 User-Agent 中使用正确的浏览器,

import re, requests

headers = { 'User-Agent': 'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)'}


def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

content = ""
for user in ['billgates']:
    content += "============================\n\n"
    content += user + "\n\n"
    content += "============================\n\n"
    url_twitter = 'https://twitter.com/%s' % user
    resp = requests.get(url_twitter, headers=headers)  # Send request
    res = re.findall(r'<p class="TweetTextSize.*?tweet-text.*?>(.*?)</p>',resp.text)
    for x in res:
        x = cleanhtml(x)
        x = x.replace("&#39;","'")
        x = x.replace('&quot;','"')
        x = x.replace("&nbsp;"," ")
        content += x 
        content += "\n\n"
        content += "---"
        content += "\n\n"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM