简体   繁体   English

简单的网络刮刀非常慢

[英]simple web scraper very slow

I'm fairly new to python and web-scraping in general. 我对python和web-scraping很新。 The code below works but it seems to be awfully slow for the amount of information its actually going through. 下面的代码可以工作,但它实际经历的信息量似乎非常慢。 Is there any way to easily cut down on execution time. 有没有办法轻松减少执行时间。 I'm not sure but it does seem like I have typed out more/made it more difficult then I actually needed to, any help would be appreciated. 我不确定,但似乎我输入的更多/使我更难以实际需要,任何帮助都会受到赞赏。

Currently the code starts at the sitemap then iterates through a list of additional sitemaps. 目前,代码从站点地图开始,然后遍历其他站点地图列表。 Within the new sitemaps it pulls data information to construct a url for the json data of a webpage. 在新的站点地图中,它提取数据信息以构建网页的json数据的URL。 From the json data I pull an xml link that I use to search for a string. 从json数据中我拉出一个用于搜索字符串的xml链接。 If the string is found it appends it to a text file. 如果找到该字符串,则将其附加到文本文件。

#global variable
start = 'https://www.govinfo.gov/wssearch/getContentDetail?packageId='
dash = '-'
urlSitemap="https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml"

old_xml=requests.get(urlSitemap)
print (old_xml)
new_xml= io.BytesIO(old_xml.content).read()
final_xml=BeautifulSoup(new_xml)
linkToBeFound = final_xml.findAll('loc')
for loc in linkToBeFound:
    urlPLmap=loc.text
    old_xmlPLmap=requests.get(urlPLmap)
    print(old_xmlPLmap)
    new_xmlPLmap= io.BytesIO(old_xmlPLmap.content).read()
    final_xmlPLmap=BeautifulSoup(new_xmlPLmap)
    linkToBeFound2 = final_xmlPLmap.findAll('loc')
    for pls in linkToBeFound2:
        argh = pls.text.find('PLAW')
        theWanted = pls.text[argh:]
        thisShallWork =eval(requests.get(start + theWanted).text)
        print(requests.get(start + theWanted))
        dict1 = (thisShallWork['download'])
        finaldict = (dict1['modslink'])[2:]
        print(finaldict)
        url2='https://' + finaldict
        try:    
            old_xml4=requests.get(url2)
            print(old_xml4)
            new_xml4= io.BytesIO(old_xml4.content).read()
            final_xml4=BeautifulSoup(new_xml4)
            references = final_xml4.findAll('identifier',{'type': 'Statute citation'})
            for sec in references: 
                if sec.text == "106 Stat. 4845":
                    Print(dash * 20)
                    print(sec.text)
                    Print(dash * 20)
                    sec313 = open('sec313info.txt','a')
                    sec313.write("\n")
                    sec313.write(pls.text + '\n')
                    sec313.close()
        except:
            print('error at: ' + url2)

No idea why i spent so long on this, but i did. 不知道为什么我花了这么长时间,但我做到了。 Your code was really hard to look through. 你的代码真的难以查看。 So i started with that, I broke it up into 2 parts, getting the links from the sitemaps, then the other stuff. 所以我从那开始,我把它分成两部分,从站点地图获取链接,然后是其他东西。 I broke out a few bits into separate functions too. 我也将一些比特分成了不同的功能。 This is checking about 2 urls per second on my machine which seems about right. 这是检查我的机器上每秒2个网址,这似乎是正确的。 How this is better (you can argue with me about this part). 如何做得更好(你可以和我争论这个部分)。

  • Don't have to reopen and close the output file after each write 每次写入后不必重新打开和关闭输出文件
  • Removed a fair bit of unneeded code 删除了一些不需要的代码
  • gave your variables better names (this does not improve speed in any way but please do this especially if you are asking for help with it) 给你的变量更好的名字(这不会以任何方式提高速度,但请特别是如果你要求帮助的话)
  • Really the main thing... once you break it all up it becomes fairly clear that whats slowing you down is waiting on the requests which is pretty standard for web-scraping, you can look into multi threading to avoid the wait. 真的是最重要的事情......一旦你把它全部搞砸了就会变得相当清楚,减慢你的速度是等待网页抓取非常标准的请求,你可以考虑多线程来避免等待。 Once you get into multi threading, the benefit of breaking up your code will likely also become much more evident. 一旦进入多线程,分解代码的好处也可能变得更加明显。
# returns sitemap links
def get_links(s):
    old_xml = requests.get(s)
    new_xml = old_xml.text
    final_xml = BeautifulSoup(new_xml, "lxml")
    return final_xml.findAll('loc')

# gets the final url from your middle url and looks through it for the thing you are looking for
def scrapey(link):
    link_id = link[link.find("PLAW"):]
    r = requests.get('https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id))
    print(r.url)
    try:
        r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:]))
        print(r.url)
        soup = BeautifulSoup(r.text, "lxml")
        references = soup.findAll('identifier', {'type': 'Statute citation'})
        for ref in references:
            if ref.text == "106 Stat. 4845":
                return r.url
        else:
            return False
    except:
        print("bah" + r.url)
        return False


sitemap_links_el = get_links("https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml")
sitemap_links = map(lambda x: x.text, sitemap_links_el)
nlinks_el = map(get_links, sitemap_links)
links = [num.text for elem in nlinks_el for num in elem]



with open("output.txt", "a") as f:
    for link in links:
        url = scrapey(link)
        if url is False:
            print("no find")
        else:
            print("found on: {}".format(url))
            f.write("{}\n".format(url))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM