简体   繁体   English

Python Webscrape with Beautiful Soup

[英]Python Webscrape with Beautiful Soup

I am new to python and working on a webscraper.我是 python 新手,正在开发 webscraper。 My issues is that my list is only populating the first link in each category.我的问题是我的列表只填充每个类别中的第一个链接。 Length on output is 9, but should be 25. I am pretty sure my error has something to do with my l=[] and d={}, but not sure.输出的长度是 9,但应该是 25。我很确定我的错误与我的 l=[] 和 d={} 有关,但不确定。

Any help would be appreciated.任何帮助,将不胜感激。

import requests
from bs4 import BeautifulSoup
import gspread
import re
#import pandas as pd

url = 'https://www.astro.org/Patient-Care-and-Research/Clinical-Practice-Statements/Clinical-Practice-Guidelines'

r=requests.get(url)
c=r.content

soup=BeautifulSoup(c,'lxml')

all=soup.find_all('div', {'class':'panel-body'})

l=[]
for item in all:
      
    try:
        links=item.find_all('a')
        for a in links:
            d={}
            d['link']=zurl= ("https://www.astro.org" + a['href'])
            r2=requests.get(zurl)
            c2=r2.content
            soup2=BeautifulSoup(c2,'html.parser')
            title=soup2.select('#form > div.wrapper.interior-page > section:nth-child(6) > div > div > div.col-md-8.col-md-offset-1.col-sm-8.col-sm-offset-1.col-xs-12.floatright > div:nth-child(1) > div > h1')
            titlelst = title[:len(title)]
            titleparagraph = []
            for x in titlelst:
                titleparagraph.append(str(x.text))
                d['title']=("".join(map(str,titleparagraph)))
            all3=soup2.select('#form > div.wrapper.interior-page > section:nth-child(6) > div > div > div.col-md-8.col-md-offset-1.col-sm-8.col-sm-offset-1.col-xs-12.floatright > div:nth-child(2) > div')
            lst = all3[:len(all3)]
            paragraphs = []
            for x in lst:
                paragraphs.append(str(x.text))
                d['full']=("".join(map(str,paragraphs)))
                lplinks=x.find_all('a')
                lplinklist = []
                for a in lplinks:
                    lplinklist.append(str(a['href'])+'\n')
                    d['link2']=("".join(map(str,lplinklist)))     
                    
    except:
        print(None)
     
    l.append(d)
    print(len(l))

You just put the l.append(d) out of the for loop.您只需将l.append(d)放在 for 循环之外。 So you only appending the last d in each a you query.所以你只在你查询的每个a附加最后一个d Move it to the end of the loop and it will work fine:将它移动到循环的末尾,它会正常工作:

for item in all:

    try:
        links = item.find_all('a')
        for a in links:
            ... 
            ...    

            l.append(d)

    except:
        print(None)

print(len(l)) # prints 25

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM