简体   繁体   中英

Beautiful Soup: not grabbing correct information

I am using beautiful soup to scrape the bold flower names and its corresponding picture link: http://www.all-my-favourite-flower-names.com/list-of-flower-names.html

I want to do this for not just the flowers beginning with "A", but have the scraper work for all of the other flowers you could attempt to get (flowers starting with "B", "C", "D", etc.).

I was able to hack together something for some of the "A" flowers...

for flower in soup.find_all('b'):  #Finds flower names and appends them to the flowers list
        flower = flower.string
        if (flower != None and flower[0] == "A"):
            flowers.append(flower.strip('.()'))
        
    for link in soup.find_all('img'):  #Finds 'src' in <img> tag and appends 'src' to the links list
        links.append(link['src'].strip('https://'))

    for stragler in soup.find_all('a'):  #Finds the only flower name that doesn't follow the pattern of the other names and inserts it into flowers list
        floss = stragler.string
        if floss != None and floss == "Ageratum houstonianum.":
            flowers.insert(3, floss)

The obvious problem with this is that it will most definitely break when anything changes. Could someone please give me a hand?

The problem seems to be that the flowers have been paginated across pages. Something like this should help you loop through the different pages. CODE not tested

import urllib2
test = {'A':'', 'B':'-B', 'XYZ': '-X-Y-Z'}
flower_list = []
for key, value in test.items():
     page = urllib2.urlopen('http://www.all-my-favourite-flower-names.com/list-of-flower-names{0}.html'.format(
value)).read()
     soup = BeautifulSoup(page)
     # Now do your logic or every page, and probably save the flower names in a list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM