簡體   English   中英

記憶到磁盤 - python - 持久記憶

[英]memoize to disk - python - persistent memoization

有沒有辦法將 function 的 output 記憶到磁盤?

我有一個 function

def getHtmlOfUrl(url):
    ... # expensive computation

並想做類似的事情:

def getHtmlMemoized(url) = memoizeToFile(getHtmlOfUrl, "file.dat")

然后調用 getHtmlMemoized(url),以便對每個 url 只進行一次昂貴的計算。

Python 提供了一種非常優雅的方式來做到這一點——裝飾器。 基本上,裝飾器是一個函數,它包裝另一個函數以提供附加功能,而無需更改函數源代碼。 你的裝飾器可以這樣寫:

import json

def persist_to_file(file_name):

    def decorator(original_func):

        try:
            cache = json.load(open(file_name, 'r'))
        except (IOError, ValueError):
            cache = {}

        def new_func(param):
            if param not in cache:
                cache[param] = original_func(param)
                json.dump(cache, open(file_name, 'w'))
            return cache[param]

        return new_func

    return decorator

一旦你有了它,使用@-syntax“裝飾”這個函數,你就准備好了。

@persist_to_file('cache.dat')
def html_of_url(url):
    your function code...

請注意,此裝飾器是有意簡化的,可能不適用於所有情況,例如,當源函數接受或返回無法 json 序列化的數據時。

更多關於裝飾器: 如何制作函數裝飾器鏈?

以下是如何讓裝飾器在退出時只保存一次緩存:

import json, atexit

def persist_to_file(file_name):

    try:
        cache = json.load(open(file_name, 'r'))
    except (IOError, ValueError):
        cache = {}

    atexit.register(lambda: json.dump(cache, open(file_name, 'w')))

    def decorator(func):
        def new_func(param):
            if param not in cache:
                cache[param] = func(param)
            return cache[param]
        return new_func

    return decorator

查看joblib.Memory 這是一個可以做到這一點的圖書館。

由 Python 的 Shelve 模塊提供支持的更清潔的解決方案。 優點是緩存通過眾所周知的dict語法實時更新,也是異常證明(無需處理煩人的KeyError )。

import shelve
def shelve_it(file_name):
    d = shelve.open(file_name)

    def decorator(func):
        def new_func(param):
            if param not in d:
                d[param] = func(param)
            return d[param]

        return new_func

    return decorator

@shelve_it('cache.shelve')
def expensive_funcion(param):
    pass

這將有助於函數只計算一次。 接下來的后續調用將返回存儲的結果。

還有diskcache

from diskcache import Cache

cache = Cache("cachedir")

@cache.memoize()
def f(x, y):
    print('Running f({}, {})'.format(x, y))
    return x, y

Artemis 庫為此提供了一個模塊。 (你需要pip install artemis-ml

你裝飾你的功能:

from artemis.fileman.disk_memoize import memoize_to_disk

@memoize_to_disk
def fcn(a, b, c = None):
    results = ...
    return results

在內部,它從輸入參數中生成一個散列,並通過這個散列保存備忘錄文件。

這樣的事情應該做:

import json

class Memoize(object):
    def __init__(self, func):
        self.func = func
        self.memo = {}

    def load_memo(filename):
        with open(filename) as f:
            self.memo.update(json.load(f))

    def save_memo(filename):
        with open(filename, 'w') as f:
            json.dump(self.memo, f)

    def __call__(self, *args):
        if not args in self.memo:
            self.memo[args] = self.func(*args)
        return self.memo[args]

基本用法:

your_mem_func = Memoize(your_func)
your_mem_func.load_memo('yourdata.json')
#  do your stuff with your_mem_func

如果您想在使用后將“緩存”寫入文件 - 將來再次加載:

your_mem_func.save_memo('yournewdata.json')

假設你的數據是 json 可序列化的,這段代碼應該可以工作

import os, json

def json_file(fname):
    def decorator(function):
        def wrapper(*args, **kwargs):
            if os.path.isfile(fname):
                with open(fname, 'r') as f:
                    ret = json.load(f)
            else:
                with open(fname, 'w') as f:
                    ret = function(*args, **kwargs)
                    json.dump(ret, f)
            return ret
        return wrapper
    return decorator

裝飾getHtmlOfUrl然后簡單地調用它,如果它之前已經運行過,你會得到你的緩存數據。

檢查 python 2.x 和 python 3.x

您可以使用 cache_to_disk 包:

    from cache_to_disk import cache_to_disk

    @cache_to_disk(3)
    def my_func(a, b, c, d=None):
        results = ...
        return results

這將緩存結果 3 天,特定於參數 a、b、c 和 d。 結果存儲在您機器上的 pickle 文件中,並在下次調用該函數時取消並返回。 3 天后,pickle 文件將被刪除,直到函數重新運行。 每當使用新參數調用函數時,都會重新運行該函數。 更多信息在這里: https : //github.com/sarenehan/cache_to_disk

大多數答案都以裝飾者的方式出現。 但也許我不想每次調用函數時都緩存結果。

我使用上下文管理器制作了一個解決方案,因此該函數可以稱為

with DiskCacher('cache_id', myfunc) as myfunc2:
    res=myfunc2(...)

當您需要緩存功能時。

'cache_id' 字符串用於區分名為[calling_script]_[cache_id].dat數據文件。 因此,如果您在循環中執行此操作,則需要將循環變量合並到此cache_id ,否則數據將被覆蓋。

或者:

myfunc2=DiskCacher('cache_id')(myfunc)
res=myfunc2(...)

或者(這可能不是很有用,因為一直使用相同的 id):

@DiskCacher('cache_id')
def myfunc(*args):
    ...

帶有示例的完整代碼(我使用pickle來保存/加載,但可以更改為任何保存/讀取方法。請注意,這也假設有問題的函數僅返回 1 個返回值):

from __future__ import print_function
import sys, os
import functools

def formFilename(folder, varid):
    '''Compose abspath for cache file

    Args:
        folder (str): cache folder path.
        varid (str): variable id to form file name and used as variable id.
    Returns:
        abpath (str): abspath for cache file, which is using the <folder>
            as folder. The file name is the format:
                [script_file]_[varid].dat
    '''
    script_file=os.path.splitext(sys.argv[0])[0]
    name='[%s]_[%s].nc' %(script_file, varid)
    abpath=os.path.join(folder, name)

    return abpath


def readCache(folder, varid, verbose=True):
    '''Read cached data

    Args:
        folder (str): cache folder path.
        varid (str): variable id.
    Keyword Args:
        verbose (bool): whether to print some text info.
    Returns:
        results (tuple): a tuple containing data read in from cached file(s).
    '''
    import pickle
    abpath_in=formFilename(folder, varid)
    if os.path.exists(abpath_in):
        if verbose:
            print('\n# <readCache>: Read in variable', varid,
                    'from disk cache:\n', abpath_in)
        with open(abpath_in, 'rb') as fin:
            results=pickle.load(fin)

    return results


def writeCache(results, folder, varid, verbose=True):
    '''Write data to disk cache

    Args:
        results (tuple): a tuple containing data read to cache.
        folder (str): cache folder path.
        varid (str): variable id.
    Keyword Args:
        verbose (bool): whether to print some text info.
    '''
    import pickle
    abpath_out=formFilename(folder, varid)
    if verbose:
        print('\n# <writeCache>: Saving output to:\n',abpath_out)
    with open(abpath_out, 'wb') as fout:
        pickle.dump(results, fout)

    return


class DiskCacher(object):
    def __init__(self, varid, func=None, folder=None, overwrite=False,
            verbose=True):
        '''Disk cache context manager

        Args:
            varid (str): string id used to save cache.
                function <func> is assumed to return only 1 return value.
        Keyword Args:
            func (callable): function object whose return values are to be
                cached.
            folder (str or None): cache folder path. If None, use a default.
            overwrite (bool): whether to force a new computation or not.
            verbose (bool): whether to print some text info.
        '''

        if folder is None:
            self.folder='/tmp/cache/'
        else:
            self.folder=folder

        self.func=func
        self.varid=varid
        self.overwrite=overwrite
        self.verbose=verbose

    def __enter__(self):
        if self.func is None:
            raise Exception("Need to provide a callable function to __init__() when used as context manager.")

        return _Cache2Disk(self.func, self.varid, self.folder,
                self.overwrite, self.verbose)

    def __exit__(self, type, value, traceback):
        return

    def __call__(self, func=None):
        _func=func or self.func
        return _Cache2Disk(_func, self.varid, self.folder, self.overwrite,
                self.verbose)



def _Cache2Disk(func, varid, folder, overwrite, verbose):
    '''Inner decorator function

    Args:
        func (callable): function object whose return values are to be
            cached.
        varid (str): variable id.
        folder (str): cache folder path.
        overwrite (bool): whether to force a new computation or not.
        verbose (bool): whether to print some text info.
    Returns:
        decorated function: if cache exists, the function is <readCache>
            which will read cached data from disk. If needs to recompute,
            the function is wrapped that the return values are saved to disk
            before returning.
    '''

    def decorator_func(func):
        abpath_in=formFilename(folder, varid)

        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            if os.path.exists(abpath_in) and not overwrite:
                results=readCache(folder, varid, verbose)
            else:
                results=func(*args, **kwargs)
                if not os.path.exists(folder):
                    os.makedirs(folder)
                writeCache(results, folder, varid, verbose)
            return results
        return wrapper

    return decorator_func(func)



if __name__=='__main__':

    data=range(10)  # dummy data

    #--------------Use as context manager--------------
    def func1(data, n):
        '''dummy function'''
        results=[i*n for i in data]
        return results

    print('\n### Context manager, 1st time call')
    with DiskCacher('context_mananger', func1) as func1b:
        res=func1b(data, 10)
        print('res =', res)

    print('\n### Context manager, 2nd time call')
    with DiskCacher('context_mananger', func1) as func1b:
        res=func1b(data, 10)
        print('res =', res)

    print('\n### Context manager, 3rd time call with overwrite=True')
    with DiskCacher('context_mananger', func1, overwrite=True) as func1b:
        res=func1b(data, 10)
        print('res =', res)

    #--------------Return a new function--------------
    def func2(data, n):
        results=[i*n for i in data]
        return results

    print('\n### Wrap a new function, 1st time call')
    func2b=DiskCacher('new_func')(func2)
    res=func2b(data, 10)
    print('res =', res)

    print('\n### Wrap a new function, 2nd time call')
    res=func2b(data, 10)
    print('res =', res)

    #----Decorate a function using the syntax sugar----
    @DiskCacher('pie_dec')
    def func3(data, n):
        results=[i*n for i in data]
        return results

    print('\n### pie decorator, 1st time call')
    res=func3(data, 10)
    print('res =', res)

    print('\n### pie decorator, 2nd time call.')
    res=func3(data, 10)
    print('res =', res)

輸出:

### Context manager, 1st time call

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Context manager, 2nd time call

# <readCache>: Read in variable context_mananger from disk cache:
 /tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Context manager, 3rd time call with overwrite=True

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Wrap a new function, 1st time call

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[new_func].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Wrap a new function, 2nd time call

# <readCache>: Read in variable new_func from disk cache:
 /tmp/cache/[diskcache]_[new_func].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### pie decorator, 1st time call

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[pie_dec].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### pie decorator, 2nd time call.

# <readCache>: Read in variable pie_dec from disk cache:
 /tmp/cache/[diskcache]_[pie_dec].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

查看Cachier 它支持額外的緩存配置參數,如 TTL 等。

簡單的例子:

from cachier import cachier
import datetime

@cachier(stale_after=datetime.timedelta(days=3))
def foo(arg1, arg2):
  """foo now has a persistent cache, trigerring recalculation for values stored more than 3 days."""
  return {'arg1': arg1, 'arg2': arg2}

這是我想出的一個解決方案,它可以:

  • 記憶可變對象(記憶函數應該沒有改變可變參數的副作用,否則它不會按預期工作)
  • 為每個包裝的 function 寫入一個單獨的緩存文件(很容易刪除文件以清除該特定緩存)
  • 壓縮數據以使其在磁盤上更小(小很多)

它將創建緩存文件,例如:

cache.__main__.function.getApiCall.db
cache.myModule.function.fixDateFormat.db
cache.myOtherModule.function.getOtherApiCall.db

這是代碼。 您可以選擇自己喜歡的壓縮庫,但我發現 LZMA 最適合我們使用的 pickle 存儲。

import dbm
import hashlib
import pickle
# import bz2
import lzma

# COMPRESSION = bz2
COMPRESSION = lzma # better with pickle compression

# Create a @memoize_to_disk decorator to cache a memoize to disk cache
def memoize_to_disk(function, cache_filename=None):
    uniqueFunctionSignature = f'cache.{function.__module__}.{function.__class__.__name__}.{function.__name__}'
    if cache_filename is None:
        cache_filename = uniqueFunctionSignature
        # print(f'Caching to {cache_file}')
    def wrapper(*args, **kwargs):
        # Convert the dictionary into a JSON object (can't memoize mutable fields, this gives us an immutable, hashable function signature)
        if cache_filename == uniqueFunctionSignature:
            # Cache file is function-specific, so don't include function name in params
            params = {'args': args, 'kwargs': kwargs} 
        else:
            # add module.class.function name to params so no collisions occur if user overrides cache_file with the same cache for multiple functions
            params = {'function': uniqueFunctionSignature, 'args': args, 'kwargs': kwargs}

        # key hash of the json representation of the function signature (to avoid immutable dictionary errors)
        params_json = json.dumps(params)  
        key = hashlib.sha256(params_json.encode("utf-8")).hexdigest()  # store hash of key
        # Get cache entry or create it if not found
        with dbm.open(cache_filename, 'c') as db:
            # Try to retrieve the result from the cache
            try:
                result = pickle.loads(COMPRESSION.decompress(db[key]))
                # print(f'CACHE HIT: Found {key[1:100]=} in {cache_file=} with value {str(result)[0:100]=}')
                return result
            except KeyError:
                # If the result is not in the cache, call the function and store the result
                result = function(*args, **kwargs)
                db[key] = COMPRESSION.compress(pickle.dumps(result))
                # print(f'CACHE MISS: Stored {key[1:100]=} in {cache_file=} with value {str(result)[0:100]=}')
                return result
    return wrapper

要使用該代碼,請使用 @memoize_to_disk 裝飾器(如果您不喜歡“緩存”,則使用可選的文件名參數作為前綴)

@memoize_to_disk
def expensive_example(n):
  // expensive operation goes here
  return value

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM