I wrote this simple program to extract links from tweets for a certain user. I was able to extract the links that are inside the tweets, but it seems like all I am getting are links that are shortened with t.co as the domain. These links are leading to other tweets.
The problem is that these links sometimes lead to other tweets. How do I get links from tweets and make sure that these links are for an external site, not twitter itself.
I hope my question is clear because this is the best way I can describe it.
Thanks
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import re
#http://www.tweepy.org/
import tweepy
#Get your Twitter API credentials and enter them here
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""
#method to get a user's last 200 tweets
def get_tweets(username):
#http://tweepy.readthedocs.org/en/v3.1.0/getting_started.html#api
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#set count to however many tweets you want; twitter only allows 200 at once
number_of_tweets = 200
#get tweets
tweets = api.user_timeline(screen_name = username,count = number_of_tweets)
for tweet in tweets:
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
for url in urls:
print url
#if we're running this as a script
if __name__ == '__main__':
#get tweets for username passed at command line
if len(sys.argv) == 2:
get_tweets(sys.argv[1])
else:
print "Error: enter one username"
#alternative method: loop through multiple users
# users = ['user1','user2']
# for user in users:
# get_tweets(user)
Here is an output sample: (I could not post it because it has shortened links). Editor wouldn't allow me to.
In Python3, you can do Greg Filla 's answer as following:
import urllib
for tweet in tweets:
urls = re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", tweet.text)
for url in urls:
try:
opener = urllib.request.build_opener()
request = urllib.request.Request(url)
response = opener.open(request)
actual_url = response.geturl()
print(actual_url)
except:
print(url)
You need to get the redirected URL. First, add import urllib2
then try the following code:
for tweet in tweets:
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
for url in urls:
try:
res = urllib2.urlopen(url)
actual_url = res.geturl()
print actual_url
except:
print url
I have the try..except block because some of the tweets I tested were extracting invalid URLs.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.