Beautiful Soup: not grabbing correct information

Question

I am using beautiful soup to scrape the bold flower names and its corresponding picture link: http://www.all-my-favourite-flower-names.com/list-of-flower-names.html

I want to do this for not just the flowers beginning with "A", but have the scraper work for all of the other flowers you could attempt to get (flowers starting with "B", "C", "D", etc.).

I was able to hack together something for some of the "A" flowers...

for flower in soup.find_all('b'):  #Finds flower names and appends them to the flowers list
        flower = flower.string
        if (flower != None and flower[0] == "A"):
            flowers.append(flower.strip('.()'))
        
    for link in soup.find_all('img'):  #Finds 'src' in <img> tag and appends 'src' to the links list
        links.append(link['src'].strip('https://'))

    for stragler in soup.find_all('a'):  #Finds the only flower name that doesn't follow the pattern of the other names and inserts it into flowers list
        floss = stragler.string
        if floss != None and floss == "Ageratum houstonianum.":
            flowers.insert(3, floss)

The obvious problem with this is that it will most definitely break when anything changes. Could someone please give me a hand?

Answer 1

The problem seems to be that the flowers have been paginated across pages. Something like this should help you loop through the different pages. CODE not tested

import urllib2
test = {'A':'', 'B':'-B', 'XYZ': '-X-Y-Z'}
flower_list = []
for key, value in test.items():
     page = urllib2.urlopen('http://www.all-my-favourite-flower-names.com/list-of-flower-names{0}.html'.format(
value)).read()
     soup = BeautifulSoup(page)
     # Now do your logic or every page, and probably save the flower names in a list.

Beautiful Soup: not grabbing correct information

Question

1 answers

solution1
1 ACCPTED 2015-12-11 01:28:12

Beautiful Soup: not grabbing correct information

Question

1 answers

solution1 1 ACCPTED 2015-12-11 01:28:12

solution1
1 ACCPTED 2015-12-11 01:28:12