简体   繁体   English

Beautiful Soup:没有抓取正确的信息

[英]Beautiful Soup: not grabbing correct information

I am using beautiful soup to scrape the bold flower names and its corresponding picture link: http://www.all-my-favourite-flower-names.com/list-of-flower-names.html我正在用美丽的汤来刮粗体花名及其对应的图片链接: http : //www.all-my-favourite-flower-names.com/list-of-flower-names.html

I want to do this for not just the flowers beginning with "A", but have the scraper work for all of the other flowers you could attempt to get (flowers starting with "B", "C", "D", etc.).我不仅要为以“A”开头的花执行此操作,还要为您可以尝试获得的所有其他花(以“B”、“C”、“D”等开头的花)进行刮刀工作。 )。

I was able to hack together something for some of the "A" flowers...我能够为一些“A”花拼凑一些东西......

for flower in soup.find_all('b'):  #Finds flower names and appends them to the flowers list
        flower = flower.string
        if (flower != None and flower[0] == "A"):
            flowers.append(flower.strip('.()'))
        
    for link in soup.find_all('img'):  #Finds 'src' in <img> tag and appends 'src' to the links list
        links.append(link['src'].strip('https://'))

    for stragler in soup.find_all('a'):  #Finds the only flower name that doesn't follow the pattern of the other names and inserts it into flowers list
        floss = stragler.string
        if floss != None and floss == "Ageratum houstonianum.":
            flowers.insert(3, floss)

The obvious problem with this is that it will most definitely break when anything changes.这样做的明显问题是,当发生任何变化时,它肯定会崩溃。 Could someone please give me a hand?有人可以帮我一把吗?

The problem seems to be that the flowers have been paginated across pages.问题似乎是花已经跨页分页了。 Something like this should help you loop through the different pages.像这样的东西应该可以帮助您循环浏览不同的页面。 CODE not tested代码未测试

import urllib2
test = {'A':'', 'B':'-B', 'XYZ': '-X-Y-Z'}
flower_list = []
for key, value in test.items():
     page = urllib2.urlopen('http://www.all-my-favourite-flower-names.com/list-of-flower-names{0}.html'.format(
value)).read()
     soup = BeautifulSoup(page)
     # Now do your logic or every page, and probably save the flower names in a list.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM