[英]simple web scraper very slow
I'm fairly new to python and web-scraping in general. 我对python和web-scraping很新。 The code below works but it seems to be awfully slow for the amount of information its actually going through. 下面的代码可以工作,但它实际经历的信息量似乎非常慢。 Is there any way to easily cut down on execution time. 有没有办法轻松减少执行时间。 I'm not sure but it does seem like I have typed out more/made it more difficult then I actually needed to, any help would be appreciated. 我不确定,但似乎我输入的更多/使我更难以实际需要,任何帮助都会受到赞赏。
Currently the code starts at the sitemap then iterates through a list of additional sitemaps. 目前,代码从站点地图开始,然后遍历其他站点地图列表。 Within the new sitemaps it pulls data information to construct a url for the json data of a webpage. 在新的站点地图中,它提取数据信息以构建网页的json数据的URL。 From the json data I pull an xml link that I use to search for a string. 从json数据中我拉出一个用于搜索字符串的xml链接。 If the string is found it appends it to a text file. 如果找到该字符串,则将其附加到文本文件。
#global variable
start = 'https://www.govinfo.gov/wssearch/getContentDetail?packageId='
dash = '-'
urlSitemap="https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml"
old_xml=requests.get(urlSitemap)
print (old_xml)
new_xml= io.BytesIO(old_xml.content).read()
final_xml=BeautifulSoup(new_xml)
linkToBeFound = final_xml.findAll('loc')
for loc in linkToBeFound:
urlPLmap=loc.text
old_xmlPLmap=requests.get(urlPLmap)
print(old_xmlPLmap)
new_xmlPLmap= io.BytesIO(old_xmlPLmap.content).read()
final_xmlPLmap=BeautifulSoup(new_xmlPLmap)
linkToBeFound2 = final_xmlPLmap.findAll('loc')
for pls in linkToBeFound2:
argh = pls.text.find('PLAW')
theWanted = pls.text[argh:]
thisShallWork =eval(requests.get(start + theWanted).text)
print(requests.get(start + theWanted))
dict1 = (thisShallWork['download'])
finaldict = (dict1['modslink'])[2:]
print(finaldict)
url2='https://' + finaldict
try:
old_xml4=requests.get(url2)
print(old_xml4)
new_xml4= io.BytesIO(old_xml4.content).read()
final_xml4=BeautifulSoup(new_xml4)
references = final_xml4.findAll('identifier',{'type': 'Statute citation'})
for sec in references:
if sec.text == "106 Stat. 4845":
Print(dash * 20)
print(sec.text)
Print(dash * 20)
sec313 = open('sec313info.txt','a')
sec313.write("\n")
sec313.write(pls.text + '\n')
sec313.close()
except:
print('error at: ' + url2)
No idea why i spent so long on this, but i did. 不知道为什么我花了这么长时间,但我做到了。 Your code was really hard to look through. 你的代码真的难以查看。 So i started with that, I broke it up into 2 parts, getting the links from the sitemaps, then the other stuff. 所以我从那开始,我把它分成两部分,从站点地图获取链接,然后是其他东西。 I broke out a few bits into separate functions too. 我也将一些比特分成了不同的功能。 This is checking about 2 urls per second on my machine which seems about right. 这是检查我的机器上每秒2个网址,这似乎是正确的。 How this is better (you can argue with me about this part). 如何做得更好(你可以和我争论这个部分)。
# returns sitemap links
def get_links(s):
old_xml = requests.get(s)
new_xml = old_xml.text
final_xml = BeautifulSoup(new_xml, "lxml")
return final_xml.findAll('loc')
# gets the final url from your middle url and looks through it for the thing you are looking for
def scrapey(link):
link_id = link[link.find("PLAW"):]
r = requests.get('https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id))
print(r.url)
try:
r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:]))
print(r.url)
soup = BeautifulSoup(r.text, "lxml")
references = soup.findAll('identifier', {'type': 'Statute citation'})
for ref in references:
if ref.text == "106 Stat. 4845":
return r.url
else:
return False
except:
print("bah" + r.url)
return False
sitemap_links_el = get_links("https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml")
sitemap_links = map(lambda x: x.text, sitemap_links_el)
nlinks_el = map(get_links, sitemap_links)
links = [num.text for elem in nlinks_el for num in elem]
with open("output.txt", "a") as f:
for link in links:
url = scrapey(link)
if url is False:
print("no find")
else:
print("found on: {}".format(url))
f.write("{}\n".format(url))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.