I am using urllib2
module in python to fetch some kind of information from anchor tags from some urls like http://www.google.co.in/
, below is the code
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup
url = "http://www.google.co.in/"
page = urllib2.urlopen(url)
html = page.read()
page.close()
soup = BeautifulSoup(html)
for tag in soup.findAll('a', href=True):
text = tag.text
tag['href'] = urlparse.urljoin(url, tag['href'])
print ' '.join([text,tag['href']])
result:
Web History http://www.google.co.in/history/optout?hl=en
Settings http://www.google.co.in/preferences?hl=en
Sign in https://accounts.google.com/ServiceLogin?hl=en&continue=http://www.google.co.in/
Advanced search http://www.google.co.in/advanced_search?hl=en-IN&authuser=0
Language tools http://www.google.co.in/language_tools?hl=en-IN&authuser=0
.......................
Now its fine but i want to store information as list of tuples like below
[('Web History','http://www.google.co.in/history/optout?hl=en'),('Settings','http://www.google.co.in/preferences?hl=en'),('Sign in','https://accounts.google.com/ServiceLogin?hl=en&continue=http://www.google.co.in/')................]
So can anyone let me know how do we format the data coming from for loop as above list of tuples
Try something like this:
[(tag.text, urlparse.urljoin(url, tag['href']))
for tag in soup.findAll('a', href=True)]
You can try creating a hash and extracting the items()
tuple from it, this is just a hack:
def __init__(self, *args, **kwargs):
super(IndicatorForm, self).__init__(*args, **kwargs)
d = dir(indicators)
b = {}
for a in d:
b[a] = a
b = b.items()
b.sort()
self.fields["choice"].choices = b
Here dir(indicators) is an array.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.