简体   繁体   中英

Web scraping large amount of links?

I am very new to Web scraping. I have started using BeautifulSoup in Python. I wrote a code that would loop through a list of urls and get me the data i need. The code works fine for 10-12 links but I am not sure if the same code will be effective if the list has over 100 links. Is there any alternative way or any other library to get the data by inputing a list of large number of url's without harming the website in any way. Here is my code so far.

url_list = [url1, url2,url3, url4,url5]
mylist = []
for l in url_list:
    url = l 
    res = get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    data = soup.find('pre').text
    mylist.append(data)

Here's an example, maybe for you.

from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils

class MySpider(Spider):
    name = 'my_spider'
    start_urls = ['url1']
    # refresh_urls = True # If you want to download the downloaded link again, please remove the "#" in the front
    def __init__(self):
        # If your link is stored elsewhere, read it out here.
        self.start_urls = utils.getFileLines('you url file name.txt')
        Spider.__init__(self,self.name) # Necessary

    def extract(self, url, html, models, modelNames):
        doc = SimplifiedDoc(html)
        data = doc.select('pre>text()') # Extract the data you want.
        return {'Urls': None, 'Data':{'data':data} } # Return the data to the framework, which will save it for you.

SimplifiedMain.startThread(MySpider())  # Start download

You can see more examples here, as well as the source code of Library simplified_scrapy: https://github.com/yiyedata/simplified-scrapy-demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM