I am using beautiful soup to scrape the bold flower names and its corresponding picture link: http://www.all-my-favourite-flower-names.com/list-of-flower-names.html
I want to do this for not just the flowers beginning with "A", but have the scraper work for all of the other flowers you could attempt to get (flowers starting with "B", "C", "D", etc.).
I was able to hack together something for some of the "A" flowers...
for flower in soup.find_all('b'): #Finds flower names and appends them to the flowers list
flower = flower.string
if (flower != None and flower[0] == "A"):
flowers.append(flower.strip('.()'))
for link in soup.find_all('img'): #Finds 'src' in <img> tag and appends 'src' to the links list
links.append(link['src'].strip('https://'))
for stragler in soup.find_all('a'): #Finds the only flower name that doesn't follow the pattern of the other names and inserts it into flowers list
floss = stragler.string
if floss != None and floss == "Ageratum houstonianum.":
flowers.insert(3, floss)
The obvious problem with this is that it will most definitely break when anything changes. Could someone please give me a hand?
The problem seems to be that the flowers have been paginated across pages. Something like this should help you loop through the different pages. CODE not tested
import urllib2
test = {'A':'', 'B':'-B', 'XYZ': '-X-Y-Z'}
flower_list = []
for key, value in test.items():
page = urllib2.urlopen('http://www.all-my-favourite-flower-names.com/list-of-flower-names{0}.html'.format(
value)).read()
soup = BeautifulSoup(page)
# Now do your logic or every page, and probably save the flower names in a list.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.