简体   繁体   中英

How to implement caching for results after web scraping with mechanize using Python

Written in Python, my web scraping script makes use of mechanize. This is how my script looks like: (with sensitive information replaced)

import mechanize
import cookielib
from bs4 import BeautifulSoup
import html2text
import json

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_debug_responses(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Safari/8.0')]
br.open('https://example.com/login.jsp')
for f in br.forms():
    print f
br.select_form(nr=0)
br.form['u'] = 'abcd'
br.form['p'] = '1234'
br.submit()

def get_information():
    locations=[]
    data=json.load(br.open('https://example.com/iWantThisJson.jsp'))
    for entry in data["stores"]:
        location=entry["name"].split("(",1)[0]
        locations.append(location)
    return locations

After logging in,my get_information() method retrieves a list of store locations and after slicing them into what I want, I save them into a dictionary locations.This method is called in my website build with Flask, running on localhost at the moment. Here is where it is called in my website codes:

class reportDownload(Form):
    locations={}
    locations=get_information()
    locations_names=list(enumerate(locations))
    location=SelectField(u'Location',choices=locations_names)

This list is than displayed in a dropdown menu on my website for user to select an option.

My question here is how do I go about implementing caching for the results received from my get_information() method as I don't want to perform a web scraping each time the web page (where the information is used) is accessed by the user (which is quite frequently as it is one of the main page). I have tried searching for how to implement caching but as I'm still pretty new to this, I am having trouble comprehending what needs to be done. Will be grateful if anyone can point me to a relevant example!

Thank you! :)

There are plenty of easy ways to implement caching.

  1. Write data to a file. This is especially easy if you are just dealing with small amounts of data. JSON blobs can also be easily written to a file.

    with open("my_cache_file", "a+") as file_: file_.write(my_json_blob)

  2. Use a key value store to cache the data like redis. Installing redis is easy. Using it is very easy too. There is a very well written python client for redis

    redis=redis.StrictRedis() redis.set(key, json_blob) json_blob = redis.get(key)

  3. You can also use commonly available databases like Postgres, MongoDB, MariaDB, or whatever to store your data to disk, but this is only if your data is so large it is vastly outscaling the amount of memory you have on your machine (redis writes to memory)

In case anyone else visits this thread, another good option for caching when scraping (if you're using requests ) is the requests-cache module.

It's a plug-in for requests and, after a few lines for configuration, will handle caching for you.

import requests
import requests_cache

requests_cache.install_cache('name/of/cache'
                             backend='mongdb',
                             expire_after=3600)

# use requests as usual

As shown in the above example, the module allows us to easily define a cache name, backend, and expiration time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM