简体   繁体   中英

Extracting external links from tweets in python

I wrote this simple program to extract links from tweets for a certain user. I was able to extract the links that are inside the tweets, but it seems like all I am getting are links that are shortened with t.co as the domain. These links are leading to other tweets.

The problem is that these links sometimes lead to other tweets. How do I get links from tweets and make sure that these links are for an external site, not twitter itself.

I hope my question is clear because this is the best way I can describe it.

Thanks

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import re

#http://www.tweepy.org/
import tweepy

#Get your Twitter API credentials and enter them here
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""

#method to get a user's last  200 tweets
def get_tweets(username):

        #http://tweepy.readthedocs.org/en/v3.1.0/getting_started.html#api
        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_key, access_secret)
        api = tweepy.API(auth)

        #set count to however many tweets you want; twitter only allows 200 at once
        number_of_tweets = 200

        #get tweets
        tweets = api.user_timeline(screen_name = username,count = number_of_tweets)

        for tweet in tweets:
                urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
                for url in urls:
                        print url


#if we're running this as a script
if __name__ == '__main__':

    #get tweets for username passed at command line
    if len(sys.argv) == 2:
        get_tweets(sys.argv[1])
    else:
        print "Error: enter one username"

    #alternative method: loop through multiple users
        # users = ['user1','user2']

        # for user in users:
#       get_tweets(user)

Here is an output sample: (I could not post it because it has shortened links). Editor wouldn't allow me to.

In Python3, you can do Greg Filla 's answer as following:

import urllib

for tweet in tweets:
urls = re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", tweet.text)
for url in urls:
    try:
        opener = urllib.request.build_opener()
        request = urllib.request.Request(url)
        response = opener.open(request)
        actual_url = response.geturl()
        print(actual_url)
    except:
        print(url)

You need to get the redirected URL. First, add import urllib2 then try the following code:

for tweet in tweets:
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
    for url in urls:
        try:
            res = urllib2.urlopen(url)
            actual_url = res.geturl()
            print actual_url
        except:
            print url

I have the try..except block because some of the tweets I tested were extracting invalid URLs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM