简体   繁体   中英

Scraping data from website

I'm facing problem in linking the links together. i need spider code who interlinks the links on the pages and grab me the required details until now my code is able to grab the required information but there are other pages too so i need other pages information too link the base_url contains the applications info then i want to collect all the links from that page and then want to switch next page and repeat the same thing then i need to collect the each application details like their names, version no etc from the links i have been collected
so right now im able to collect all the information only links are not inter linked how i can do that help me out..... here is my code:

#extracting links
def linkextract(soup): 
    print "\n extracting links of next pages"
    print "\n\n page 2 \n"
        sAll = [div.find('a') for div in soup.findAll('div', attrs={'class':''})]
        for i in sAll:
            suburl = ""+i['href'] #checking pages
        print suburl
        pages = mech.open(suburl)
        content = pages.read()
        anosoup = BeautifulSoup(content)
        extract(anosoup)
    app_url = ""
    print app_url
    #print soup.prettify()
    page1 = mech.open(app_url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    print "\n\n application page details \n"
    extractinside(soup1)

assistance required thank you.

Here's what you should start with:

import urllib2
from bs4 import BeautifulSoup

URL = 'http://www.pcwelt.de/download-neuzugaenge.html'

soup = BeautifulSoup(urllib2.urlopen(URL))
links = [tr.td.a['href'] for tr in soup.find('div', {'class': 'boxed'}).table.find_all('tr') if tr.td]

for link in links:
    url = "http://www.pcwelt.de{0}".format(link)
    soup = BeautifulSoup(urllib2.urlopen(url))

    name = soup.find('span', {'itemprop': 'name'}).text
    version = soup.find('td', {'itemprop': 'softwareVersion'}).text
    print "Name: %s; Version: %s" % (name, version)

prints:

Name: Ashampoo Clip Finder HD Free; Version: 2.3.6
Name: Many Cam; Version: 4.0.63
Name: Roboform; Version: 7.9.5.7
...

Hope that helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM