[英]Python Webscrape with Beautiful Soup
I am new to python and working on a webscraper.我是 python 新手,正在开发 webscraper。 My issues is that my list is only populating the first link in each category.
我的问题是我的列表只填充每个类别中的第一个链接。 Length on output is 9, but should be 25. I am pretty sure my error has something to do with my l=[] and d={}, but not sure.
输出的长度是 9,但应该是 25。我很确定我的错误与我的 l=[] 和 d={} 有关,但不确定。
Any help would be appreciated.任何帮助,将不胜感激。
import requests
from bs4 import BeautifulSoup
import gspread
import re
#import pandas as pd
url = 'https://www.astro.org/Patient-Care-and-Research/Clinical-Practice-Statements/Clinical-Practice-Guidelines'
r=requests.get(url)
c=r.content
soup=BeautifulSoup(c,'lxml')
all=soup.find_all('div', {'class':'panel-body'})
l=[]
for item in all:
try:
links=item.find_all('a')
for a in links:
d={}
d['link']=zurl= ("https://www.astro.org" + a['href'])
r2=requests.get(zurl)
c2=r2.content
soup2=BeautifulSoup(c2,'html.parser')
title=soup2.select('#form > div.wrapper.interior-page > section:nth-child(6) > div > div > div.col-md-8.col-md-offset-1.col-sm-8.col-sm-offset-1.col-xs-12.floatright > div:nth-child(1) > div > h1')
titlelst = title[:len(title)]
titleparagraph = []
for x in titlelst:
titleparagraph.append(str(x.text))
d['title']=("".join(map(str,titleparagraph)))
all3=soup2.select('#form > div.wrapper.interior-page > section:nth-child(6) > div > div > div.col-md-8.col-md-offset-1.col-sm-8.col-sm-offset-1.col-xs-12.floatright > div:nth-child(2) > div')
lst = all3[:len(all3)]
paragraphs = []
for x in lst:
paragraphs.append(str(x.text))
d['full']=("".join(map(str,paragraphs)))
lplinks=x.find_all('a')
lplinklist = []
for a in lplinks:
lplinklist.append(str(a['href'])+'\n')
d['link2']=("".join(map(str,lplinklist)))
except:
print(None)
l.append(d)
print(len(l))
You just put the l.append(d)
out of the for loop.您只需将
l.append(d)
放在 for 循环之外。 So you only appending the last d
in each a
you query.所以你只在你查询的每个
a
附加最后一个d
。 Move it to the end of the loop and it will work fine:将它移动到循环的末尾,它会正常工作:
for item in all:
try:
links = item.find_all('a')
for a in links:
...
...
l.append(d)
except:
print(None)
print(len(l)) # prints 25
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.