[英]memoize to disk - python - persistent memoization
有沒有辦法將 function 的 output 記憶到磁盤?
我有一個 function
def getHtmlOfUrl(url):
... # expensive computation
並想做類似的事情:
def getHtmlMemoized(url) = memoizeToFile(getHtmlOfUrl, "file.dat")
然后調用 getHtmlMemoized(url),以便對每個 url 只進行一次昂貴的計算。
Python 提供了一種非常優雅的方式來做到這一點——裝飾器。 基本上,裝飾器是一個函數,它包裝另一個函數以提供附加功能,而無需更改函數源代碼。 你的裝飾器可以這樣寫:
import json
def persist_to_file(file_name):
def decorator(original_func):
try:
cache = json.load(open(file_name, 'r'))
except (IOError, ValueError):
cache = {}
def new_func(param):
if param not in cache:
cache[param] = original_func(param)
json.dump(cache, open(file_name, 'w'))
return cache[param]
return new_func
return decorator
一旦你有了它,使用@-syntax“裝飾”這個函數,你就准備好了。
@persist_to_file('cache.dat')
def html_of_url(url):
your function code...
請注意,此裝飾器是有意簡化的,可能不適用於所有情況,例如,當源函數接受或返回無法 json 序列化的數據時。
更多關於裝飾器: 如何制作函數裝飾器鏈?
以下是如何讓裝飾器在退出時只保存一次緩存:
import json, atexit
def persist_to_file(file_name):
try:
cache = json.load(open(file_name, 'r'))
except (IOError, ValueError):
cache = {}
atexit.register(lambda: json.dump(cache, open(file_name, 'w')))
def decorator(func):
def new_func(param):
if param not in cache:
cache[param] = func(param)
return cache[param]
return new_func
return decorator
查看joblib.Memory
。 這是一個可以做到這一點的圖書館。
由 Python 的 Shelve 模塊提供支持的更清潔的解決方案。 優點是緩存通過眾所周知的dict
語法實時更新,也是異常證明(無需處理煩人的KeyError
)。
import shelve
def shelve_it(file_name):
d = shelve.open(file_name)
def decorator(func):
def new_func(param):
if param not in d:
d[param] = func(param)
return d[param]
return new_func
return decorator
@shelve_it('cache.shelve')
def expensive_funcion(param):
pass
這將有助於函數只計算一次。 接下來的后續調用將返回存儲的結果。
還有diskcache
。
from diskcache import Cache
cache = Cache("cachedir")
@cache.memoize()
def f(x, y):
print('Running f({}, {})'.format(x, y))
return x, y
Artemis 庫為此提供了一個模塊。 (你需要pip install artemis-ml
)
你裝飾你的功能:
from artemis.fileman.disk_memoize import memoize_to_disk
@memoize_to_disk
def fcn(a, b, c = None):
results = ...
return results
在內部,它從輸入參數中生成一個散列,並通過這個散列保存備忘錄文件。
這樣的事情應該做:
import json
class Memoize(object):
def __init__(self, func):
self.func = func
self.memo = {}
def load_memo(filename):
with open(filename) as f:
self.memo.update(json.load(f))
def save_memo(filename):
with open(filename, 'w') as f:
json.dump(self.memo, f)
def __call__(self, *args):
if not args in self.memo:
self.memo[args] = self.func(*args)
return self.memo[args]
基本用法:
your_mem_func = Memoize(your_func)
your_mem_func.load_memo('yourdata.json')
# do your stuff with your_mem_func
如果您想在使用后將“緩存”寫入文件 - 將來再次加載:
your_mem_func.save_memo('yournewdata.json')
假設你的數據是 json 可序列化的,這段代碼應該可以工作
import os, json
def json_file(fname):
def decorator(function):
def wrapper(*args, **kwargs):
if os.path.isfile(fname):
with open(fname, 'r') as f:
ret = json.load(f)
else:
with open(fname, 'w') as f:
ret = function(*args, **kwargs)
json.dump(ret, f)
return ret
return wrapper
return decorator
裝飾getHtmlOfUrl
然后簡單地調用它,如果它之前已經運行過,你會得到你的緩存數據。
檢查 python 2.x 和 python 3.x
您可以使用 cache_to_disk 包:
from cache_to_disk import cache_to_disk
@cache_to_disk(3)
def my_func(a, b, c, d=None):
results = ...
return results
這將緩存結果 3 天,特定於參數 a、b、c 和 d。 結果存儲在您機器上的 pickle 文件中,並在下次調用該函數時取消並返回。 3 天后,pickle 文件將被刪除,直到函數重新運行。 每當使用新參數調用函數時,都會重新運行該函數。 更多信息在這里: https : //github.com/sarenehan/cache_to_disk
大多數答案都以裝飾者的方式出現。 但也許我不想每次調用函數時都緩存結果。
我使用上下文管理器制作了一個解決方案,因此該函數可以稱為
with DiskCacher('cache_id', myfunc) as myfunc2:
res=myfunc2(...)
當您需要緩存功能時。
'cache_id' 字符串用於區分名為[calling_script]_[cache_id].dat
數據文件。 因此,如果您在循環中執行此操作,則需要將循環變量合並到此cache_id
,否則數據將被覆蓋。
或者:
myfunc2=DiskCacher('cache_id')(myfunc)
res=myfunc2(...)
或者(這可能不是很有用,因為一直使用相同的 id):
@DiskCacher('cache_id')
def myfunc(*args):
...
帶有示例的完整代碼(我使用pickle
來保存/加載,但可以更改為任何保存/讀取方法。請注意,這也假設有問題的函數僅返回 1 個返回值):
from __future__ import print_function
import sys, os
import functools
def formFilename(folder, varid):
'''Compose abspath for cache file
Args:
folder (str): cache folder path.
varid (str): variable id to form file name and used as variable id.
Returns:
abpath (str): abspath for cache file, which is using the <folder>
as folder. The file name is the format:
[script_file]_[varid].dat
'''
script_file=os.path.splitext(sys.argv[0])[0]
name='[%s]_[%s].nc' %(script_file, varid)
abpath=os.path.join(folder, name)
return abpath
def readCache(folder, varid, verbose=True):
'''Read cached data
Args:
folder (str): cache folder path.
varid (str): variable id.
Keyword Args:
verbose (bool): whether to print some text info.
Returns:
results (tuple): a tuple containing data read in from cached file(s).
'''
import pickle
abpath_in=formFilename(folder, varid)
if os.path.exists(abpath_in):
if verbose:
print('\n# <readCache>: Read in variable', varid,
'from disk cache:\n', abpath_in)
with open(abpath_in, 'rb') as fin:
results=pickle.load(fin)
return results
def writeCache(results, folder, varid, verbose=True):
'''Write data to disk cache
Args:
results (tuple): a tuple containing data read to cache.
folder (str): cache folder path.
varid (str): variable id.
Keyword Args:
verbose (bool): whether to print some text info.
'''
import pickle
abpath_out=formFilename(folder, varid)
if verbose:
print('\n# <writeCache>: Saving output to:\n',abpath_out)
with open(abpath_out, 'wb') as fout:
pickle.dump(results, fout)
return
class DiskCacher(object):
def __init__(self, varid, func=None, folder=None, overwrite=False,
verbose=True):
'''Disk cache context manager
Args:
varid (str): string id used to save cache.
function <func> is assumed to return only 1 return value.
Keyword Args:
func (callable): function object whose return values are to be
cached.
folder (str or None): cache folder path. If None, use a default.
overwrite (bool): whether to force a new computation or not.
verbose (bool): whether to print some text info.
'''
if folder is None:
self.folder='/tmp/cache/'
else:
self.folder=folder
self.func=func
self.varid=varid
self.overwrite=overwrite
self.verbose=verbose
def __enter__(self):
if self.func is None:
raise Exception("Need to provide a callable function to __init__() when used as context manager.")
return _Cache2Disk(self.func, self.varid, self.folder,
self.overwrite, self.verbose)
def __exit__(self, type, value, traceback):
return
def __call__(self, func=None):
_func=func or self.func
return _Cache2Disk(_func, self.varid, self.folder, self.overwrite,
self.verbose)
def _Cache2Disk(func, varid, folder, overwrite, verbose):
'''Inner decorator function
Args:
func (callable): function object whose return values are to be
cached.
varid (str): variable id.
folder (str): cache folder path.
overwrite (bool): whether to force a new computation or not.
verbose (bool): whether to print some text info.
Returns:
decorated function: if cache exists, the function is <readCache>
which will read cached data from disk. If needs to recompute,
the function is wrapped that the return values are saved to disk
before returning.
'''
def decorator_func(func):
abpath_in=formFilename(folder, varid)
@functools.wraps(func)
def wrapper(*args, **kwargs):
if os.path.exists(abpath_in) and not overwrite:
results=readCache(folder, varid, verbose)
else:
results=func(*args, **kwargs)
if not os.path.exists(folder):
os.makedirs(folder)
writeCache(results, folder, varid, verbose)
return results
return wrapper
return decorator_func(func)
if __name__=='__main__':
data=range(10) # dummy data
#--------------Use as context manager--------------
def func1(data, n):
'''dummy function'''
results=[i*n for i in data]
return results
print('\n### Context manager, 1st time call')
with DiskCacher('context_mananger', func1) as func1b:
res=func1b(data, 10)
print('res =', res)
print('\n### Context manager, 2nd time call')
with DiskCacher('context_mananger', func1) as func1b:
res=func1b(data, 10)
print('res =', res)
print('\n### Context manager, 3rd time call with overwrite=True')
with DiskCacher('context_mananger', func1, overwrite=True) as func1b:
res=func1b(data, 10)
print('res =', res)
#--------------Return a new function--------------
def func2(data, n):
results=[i*n for i in data]
return results
print('\n### Wrap a new function, 1st time call')
func2b=DiskCacher('new_func')(func2)
res=func2b(data, 10)
print('res =', res)
print('\n### Wrap a new function, 2nd time call')
res=func2b(data, 10)
print('res =', res)
#----Decorate a function using the syntax sugar----
@DiskCacher('pie_dec')
def func3(data, n):
results=[i*n for i in data]
return results
print('\n### pie decorator, 1st time call')
res=func3(data, 10)
print('res =', res)
print('\n### pie decorator, 2nd time call.')
res=func3(data, 10)
print('res =', res)
輸出:
### Context manager, 1st time call
# <writeCache>: Saving output to:
/tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
### Context manager, 2nd time call
# <readCache>: Read in variable context_mananger from disk cache:
/tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
### Context manager, 3rd time call with overwrite=True
# <writeCache>: Saving output to:
/tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
### Wrap a new function, 1st time call
# <writeCache>: Saving output to:
/tmp/cache/[diskcache]_[new_func].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
### Wrap a new function, 2nd time call
# <readCache>: Read in variable new_func from disk cache:
/tmp/cache/[diskcache]_[new_func].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
### pie decorator, 1st time call
# <writeCache>: Saving output to:
/tmp/cache/[diskcache]_[pie_dec].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
### pie decorator, 2nd time call.
# <readCache>: Read in variable pie_dec from disk cache:
/tmp/cache/[diskcache]_[pie_dec].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
查看Cachier 。 它支持額外的緩存配置參數,如 TTL 等。
簡單的例子:
from cachier import cachier
import datetime
@cachier(stale_after=datetime.timedelta(days=3))
def foo(arg1, arg2):
"""foo now has a persistent cache, trigerring recalculation for values stored more than 3 days."""
return {'arg1': arg1, 'arg2': arg2}
這是我想出的一個解決方案,它可以:
它將創建緩存文件,例如:
cache.__main__.function.getApiCall.db
cache.myModule.function.fixDateFormat.db
cache.myOtherModule.function.getOtherApiCall.db
這是代碼。 您可以選擇自己喜歡的壓縮庫,但我發現 LZMA 最適合我們使用的 pickle 存儲。
import dbm
import hashlib
import pickle
# import bz2
import lzma
# COMPRESSION = bz2
COMPRESSION = lzma # better with pickle compression
# Create a @memoize_to_disk decorator to cache a memoize to disk cache
def memoize_to_disk(function, cache_filename=None):
uniqueFunctionSignature = f'cache.{function.__module__}.{function.__class__.__name__}.{function.__name__}'
if cache_filename is None:
cache_filename = uniqueFunctionSignature
# print(f'Caching to {cache_file}')
def wrapper(*args, **kwargs):
# Convert the dictionary into a JSON object (can't memoize mutable fields, this gives us an immutable, hashable function signature)
if cache_filename == uniqueFunctionSignature:
# Cache file is function-specific, so don't include function name in params
params = {'args': args, 'kwargs': kwargs}
else:
# add module.class.function name to params so no collisions occur if user overrides cache_file with the same cache for multiple functions
params = {'function': uniqueFunctionSignature, 'args': args, 'kwargs': kwargs}
# key hash of the json representation of the function signature (to avoid immutable dictionary errors)
params_json = json.dumps(params)
key = hashlib.sha256(params_json.encode("utf-8")).hexdigest() # store hash of key
# Get cache entry or create it if not found
with dbm.open(cache_filename, 'c') as db:
# Try to retrieve the result from the cache
try:
result = pickle.loads(COMPRESSION.decompress(db[key]))
# print(f'CACHE HIT: Found {key[1:100]=} in {cache_file=} with value {str(result)[0:100]=}')
return result
except KeyError:
# If the result is not in the cache, call the function and store the result
result = function(*args, **kwargs)
db[key] = COMPRESSION.compress(pickle.dumps(result))
# print(f'CACHE MISS: Stored {key[1:100]=} in {cache_file=} with value {str(result)[0:100]=}')
return result
return wrapper
要使用該代碼,請使用 @memoize_to_disk 裝飾器(如果您不喜歡“緩存”,則使用可選的文件名參數作為前綴)
@memoize_to_disk
def expensive_example(n):
// expensive operation goes here
return value
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.