I have this link:
http://www.brothersoft.com/windows/categories.html
I am trying to to get the link for the item inside the div. Example:
http://www.brothersoft.com/windows/mp3_audio/midi_tools/
I have tried this code:
import urllib
from bs4 import BeautifulSoup
url = 'http://www.brothersoft.com/windows/categories.html'
pageHtml = urllib.urlopen(url).read()
soup = BeautifulSoup(pageHtml)
sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':'brLeft'})]
for i in sAll:
print "http://www.brothersoft.com"+i['href']
But I only get output:
http://www.brothersoft.com/windows/mp3_audio/
How can I get output that I needed?
Url http://www.brothersoft.com/windows/mp3_audio/midi_tools/
is not in tag <div class='brLeft'>
, so if output is http://www.brothersoft.com/windows/mp3_audio/
, that's correct.
If you want to get the url you want, change
sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':'brLeft'})]
to
sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':'brRight'})]
UPDATE:
an example to get info inside 'midi_tools'
import urllib
from bs4 import BeautifulSoup
url = 'http://www.brothersoft.com/windows/categories.html'
pageHtml = urllib.urlopen(url).read()
soup = BeautifulSoup(pageHtml)
sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':'brRight'})]
for i in sAll:
suburl = "http://www.brothersoft.com"+i['href'] #which is a url like 'midi_tools'
content = urllib.urlopen(suburl).read()
anosoup = BeautifulSoup(content)
ablock = anosoup.find('table',{'id':'courseTab'})
for atr in ablock.findAll('tr',{'class':'border_bot '}):
print atr.find('dt').a.string #name
print "http://www.brothersoft.com" + atr.find('a',{'class':'tabDownload'})['href'] #link
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.