简体   繁体   中英

Using Flask-Cache to cache a lxml.html object

I'm attempting to make a Flask web application where you have to request the entirety of a non-local website and I was wondering if it was possible to cache it for the purposes of speeding things up, because the website does not change that often but I still want it to update the cache once a day or so.

Anyway, I looked it up and found Flask-Cache, which seemed to do what I wanted so I made appropriate changes to it, and came up with adding this:

from flask.ext.cache import Cache
[...]
cache = Cache()
[...]
cache.init_app(app)
[...]
@cache.cached(timeout=86400, key_prefix='content')
def get_content():
    return lxml.html.fromstring(urllib2.urlopen('http://WEBSITE.com').read())

and then I make a call from the functions that need the content to proceed like so:

content = get_content()

Now I'd expect it to reuse the cached lxml.html object everytime a call is made, but that's not what I'm seeing. The id of the object changes every time a call is made and there's no speed-up at all. So have I misunderstood what Flask-Cache does, or am I doing something wrong here? I've tried using the memoize decorator instead, I've tried decreasing the timeout or removing it all together but nothing seems to be making anything difference.

Thanks.

The default CACHE_TYPE is null which gives you a NullCache - so you get no caching at all which is what you observe. The documentation does not make this explicit, but this line in the source of Cache.init_app does:

self.config.setdefault('CACHE_TYPE', 'null')

To actually employ some caching, initialise your Cache instance to use a proper cache.

cache = Cache(config={'CACHE_TYPE': 'simple'})

Aside: Note that SimpleCache is great for development and testing, and this example, but you shouldn't use it in production. Something like MemCached or RedisCache would be much better

Now, with an actual cache in place, you will run into the next problem. On the second call, the cached lxml.html object will be retrieved from the Cache , but it is broken because these objects are not cacheable. Stacktrace looks like this:

Traceback (most recent call last):
  File "/home/day/.virtualenvs/so-flask/lib/python2.7/site-packages/flask/app.py", line 1701, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/day/.virtualenvs/so-flask/lib/python2.7/site-packages/flask/app.py", line 1689, in wsgi_app
    response = self.make_response(self.handle_exception(e))
  File "/home/day/.virtualenvs/so-flask/lib/python2.7/site-packages/flask/app.py", line 1687, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/day/.virtualenvs/so-flask/lib/python2.7/site-packages/flask/app.py", line 1360, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/day/.virtualenvs/so-flask/lib/python2.7/site-packages/flask/app.py", line 1358, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/day/.virtualenvs/so-flask/lib/python2.7/site-packages/flask/app.py", line 1344, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/day/q12030403.py", line 20, in index
    return "get_content returned: {0!r}\n".format(get_content())
  File "lxml.etree.pyx", line 1034, in lxml.etree._Element.__repr__ (src/lxml/lxml.etree.c:41389)

  File "lxml.etree.pyx", line 881, in lxml.etree._Element.tag.__get__ (src/lxml/lxml.etree.c:39979)

  File "apihelpers.pxi", line 15, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:12306)

AssertionError: invalid Element proxy at 3056741852

So instead of caching the lxml.html object, you should just cache the simple string - the content of the website that you downloaded, and then reparse that to get a fresh lxml.html object every time. Your cache still helps as you don't hit the other website every time. Here is a full program to demonstrate that solution which works:

from flask import Flask
from flask.ext.cache import Cache
import time
import lxml.html
import urllib2

app = Flask(__name__)

cache = Cache(config={'CACHE_TYPE': 'simple'})
cache.init_app(app)

@cache.cached(timeout=86400, key_prefix='content')
def get_content():
    app.logger.debug("get_content called")
#    return lxml.html.fromstring(urllib2.urlopen('http://daybarr.com/wishlist').read())
    return urllib2.urlopen('http://daybarr.com/wishlist').read()

@app.route("/")
def index():
    app.logger.debug("index called")
    return "get_content returned: {0!r}\n".format(get_content())

if __name__ == "__main__":
    app.run(debug=True)

When I run the program, and make two requests to http://127.0.0.1:5000/ , I get this output. Note that get_content is not called the second time, because the content is served from cache.

 * Running on http://127.0.0.1:5000/
 * Restarting with reloader
--------------------------------------------------------------------------------
DEBUG in q12030403 [q12030403.py:20]:
index called
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
DEBUG in q12030403 [q12030403.py:14]:
get_content called
--------------------------------------------------------------------------------
127.0.0.1 - - [21/Dec/2012 00:03:28] "GET / HTTP/1.1" 200 -
--------------------------------------------------------------------------------
DEBUG in q12030403 [q12030403.py:20]:
index called
--------------------------------------------------------------------------------
127.0.0.1 - - [21/Dec/2012 00:03:33] "GET / HTTP/1.1" 200 -

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM