简体繁体 English

Nginx / uwsgi服务器的持久内存Python对象

[英]Persistent in-memory Python object for nginx/uwsgi server

原文 2013-03-15 23:31:54 1 4 python/ optimization/ nginx/ redis/ uwsgi

I doubt this is even possible, but here is the problem and proposed solution (the feasibility of the proposed solution is the object of this question): 我怀疑这是否可能，但这是问题和建议的解决方案（建议的解决方案的可行性是此问题的目的）：

I have some "global data" that needs to be available for all requests. 我有一些“全局数据”，需要对所有请求可用。 I'm persisting this data to Riak and using Redis as a caching layer for access speed (for now...). 我将这些数据保存到Riak，并将Redis用作访问速度的缓存层（目前...）。 The data is split into about 30 logical chunks, each about 8 KB. 数据分为大约30个逻辑块，每个逻辑块约8 KB。

Each request is required to read 4 of these 8KB chunks, resulting in 32KB of data read in from Redis or Riak. 每个请求都需要读取这8KB块中的4个，从而从Redis或Riak读取32KB数据。 This is in ADDITION to any request-specific data which would also need to be read (which is quite a bit). 这是对任何特定于请求的数据的补充，这也需要读取（相当多）。

Assuming even 3000 requests per second (this isn't a live server so I don't have real numbers, but 3000ps is a reasonable assumption, could be more), this means 96KBps of transfer from Redis or Riak in ADDITION to the already not-insignificant other calls being made from the application logic. 假设每秒甚至有3000个请求（这不是实时服务器，因此我没有实数，但是3000ps是一个合理的假设，可能会更多），这意味着从Redis或Riak的ADDITION中传输了96KBps到已经没有-从应用程序逻辑进行的其他无关紧要的调用。 Also, Python is parsing the JSON of these 8KB objects 3000 times every second. 此外，Python 每秒将这些8KB对象的JSON解析3000次。

All of this - especially Python having to repeatedly deserialize the data - seems like an utter waste, and a perfectly elegant solution would be to just have the deserialized data cached in an in-memory native object in Python , which I can refresh periodically as and when all this "static" data becomes stale. 所有这一切-尤其是Python必须反复反序列化的数据-似乎是完全浪费，而一个完美的解决方案是将反序列化的数据仅缓存在Python的内存中本机对象中 ，我可以在和当所有这些“静态”数据变得陈旧时。 Once in a few minutes (or hours), instead of 3000 times per second. 在几分钟（或几小时）内一次，而不是每秒3000次。

But I don't know if this is even possible. 但是我不知道这是否有可能。 You'd realistically need an "always running" application for it to cache any data in its memory. 实际上，您将需要一个“始终运行”的应用程序来将其缓存在内存中。 And I know this is not the case in the nginx+uwsgi+python combination (versus something like node) - python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken. 而且我知道在nginx + uwsgi + python组合中不是这种情况（与之类的node相比） -python内存数据不会在我所知的所有请求中持久存在 ，除非我非常误解。

Unfortunately this is a system I have "inherited" and therefore can't make too many changes in terms of the base technology, nor am I knowledgeable enough of how the nginx+uwsgi+python combination works in terms of starting up Python processes and persisting Python in-memory data - which means I COULD be terribly mistaken with my assumption above! 不幸的是，这是我“继承”的系统，因此无法对基本技术进行太多更改，我也不了解nginx + uwsgi + python组合在启动Python进程和持久化方面的工作方式Python内存数据-这意味着我可能会被上述假设完全误认为！

So, direct advice on whether this solution would work + references to material that could help me understand how the nginx+uwsgi+python would work in terms of starting new processes and memory allocation, would help greatly. 因此， 关于此解决方案是否有效的直接建议 +可以帮助我理解nginx + uwsgi + python如何在启动新进程和内存分配方面发挥作用的材料的参考将有很大帮助。

PS: PS：

Have gone through some of the documentation for nginx, uwsgi etc but haven't fully understood the ramifications per my use-case yet. 已经阅读了有关nginx，uwsgi等的一些文档，但是还没有完全理解我的用例所产生的后果。 Hope to make some progress on that going forward now 希望现在能在这方面取得一些进展
If the in-memory thing COULD work out, I would chuck Redis, since I'm caching ONLY the static data I mentioned above, in it. 如果内存中的内容可以解决，我会拒绝Redis，因为我仅将上面提到的静态数据缓存在其中。 This makes an in-process persistent in-memory Python cache even more attractive for me, reducing one moving part in the system and at least FOUR network round-trips per request. 这使进程内持久性内存中Python缓存对我来说更加有吸引力，从而减少了系统中的一个移动部分，并且每个请求至少减少了四次网络往返。

4 个解决方案

What you're suggesting isn't directly feasible. 您的建议并非直接可行。 Since new processes can be spun up and down outside of your control, there's no way to keep native Python data in memory. 由于新流程可以在控件外部旋转，因此无法将本机Python数据保留在内存中。

However, there are a few ways around this. 但是，有几种解决方法。

Often, one level of key-value storage is all you need. 通常，您只需要一层键值存储。 And sometimes, having fixed-size buffers for values (which you can use directly as str / bytes / bytearray objects; anything else you need to struct in there or otherwise serialize) is all you need. 有时，只需要有固定大小的值缓冲区即可（您可以将其直接用作str / bytes / bytearray对象；您需要在其中struct或进行序列化的任何其他对象）。 In that case, uWSGI's built-in caching framework will take care of everything you need. 在这种情况下，uWSGI的内置缓存框架将满足您的所有需求。

If you need more precise control, you can look at how the cache is implemented on top of SharedArea and do something customize. 如果需要更精确的控制，则可以查看如何在SharedArea上实现缓存并进行自定义。 However, I wouldn't recommend that. 但是，我不建议这样做。 It basically gives you the same kind of API you get with a file, and the only real advantages over just using a file are that the server will manage the file's lifetime; 它基本上为您提供了与文件相同的API，与仅使用文件相比，唯一真正的优点是服务器将管理文件的生存期。 it works in all uWSGI-supported languages, even those that don't allow files; 它适用于所有uWSGI支持的语言，甚至包括那些不允许文件的语言； and it makes it easier to migrate your custom cache to a distributed (multi-computer) cache if you later need to. 并且可以在以后需要时更轻松地将自定义缓存迁移到分布式（多计算机）缓存。 I don't think any of those are relevant to you. 我认为这些都与您无关。

Another way to get flat key-value storage, but without the fixed-size buffers, is with Python's stdlib anydbm . 获得固定键值存储但没有固定大小缓冲区的另一种方法是使用Python的stdlib anydbm 。 The key-value lookup is as pythonic as it gets: it looks just like a dict , except that it's backed up to an on-disk BDB (or similar) database, cached as appropriate in memory, instead of being stored in an in-memory hash table. 键值查找就象它的pythonic一样：它看起来像dict ，只是它已备份到磁盘上的BDB（或类似数据库）数据库中，并适当地缓存在内存中，而不是存储在内存中。内存哈希表。

If you need to handle a few other simple types—anything that's blazingly fast to un/pickle, like int s—you may want to consider shelve . 如果您需要处理其他一些简单的类型（例如int ，那些非常快要解决的问题），则可能需要考虑shelve 。

If your structure is rigid enough, you can use key-value database for the top level, but access the values through a ctypes.Structure , or de/serialize with struct . 如果您的结构足够僵化，则可以在顶层使用键值数据库，但是可以通过ctypes.Structure或struct序列化。 But usually, if you can do that, you can also eliminate the top level, at which point your whole thing is just one big Structure or Array . 但是通常，如果可以这样做，还可以消除顶层，这时您的整个工作就只是一个大的Structure或Array 。

At that point, you can just use a plain file for storage—either mmap it (for ctypes ), or just open and read it (for struct ). 那时，您可以只使用一个普通文件进行存储mmap （对于ctypes ），或者只是open并read它（对于struct ）。

Or use multiprocessing 's Shared ctypes Objects to access your Structure directly out of a shared memory area. 或使用multiprocessing的Shared ctypes对象直接在共享内存区域之外访问Structure 。

Meanwhile, if you don't actually need all of the cache data all the time, just bits and pieces every once in a while, that's exactly what databases are for. 同时，如果您实际上并不是一直都在需要所有缓存数据，而只是偶尔需要一点一点，那正是数据库的用途。 Again, anydbm , etc. may be all you need, but if you've got complex structure, draw up an ER diagram, turn it into a set of tables, and use something like MySQL. 同样， anydbm等可能只是您所需要的，但是如果您具有复杂的结构，请绘制一个ER图，将其变成一组表，然后使用类似MySQL的东西。

You said nothing about writing this data back, is it static? 您没有说回写这些数据，这是静态的吗？ In this case, the solution is every simple, and I have no clue what is up with all the "it's not feasible" responses. 在这种情况下，解决方案非常简单，而且我不知道所有“不可行”的答案是怎么回事。

Uwsgi workers are always-running applications. Uwsgi工作者是始终运行的应用程序。 So data absolutely gets persisted between requests. 因此，数据在请求之间绝对保持不变。 All you need to do is store stuff in a global variable, that is it. 您需要做的就是将东西存储在全局变量中。 And remember it's per-worker, and workers do restart from time to time, so you need proper loading/invalidation strategies. 记住，它是按工作人员进行的，工作人员会不时地重新启动，因此您需要适当的加载/失效策略。

If the data is updated very rarely (rarely enough to restart the server when it does), you can save even more. 如果数据很少更新（很少有更新时可以重新启动服务器），则可以节省更多。 Just create the objects during app construction. 只需在应用构建过程中创建对象。 This way, they will be created exactly once, and then all the workers will fork off the master, and reuse the same data. 这样，它们将只创建一次，然后所有工作人员将派生主服务器，并重复使用相同的数据。 Of course, it's copy-on-write, so if you update it, you will lose the memory benefits (same thing will happen if python decides to compact its memory during a gc run, so it's not super predictable). 当然，它是写时复制的，所以如果更新它，将会失去内存的好处（如果python在gc运行期间决定压缩其内存，也会发生同样的事情，因此这不是超级可预测的）。

我从来没有亲自尝试过，但是您可以使用uWSGI的SharedArea完成您想要做的事情吗？

"python in-memory data will NOT be persisted across all requests to my knowledge, unless I'm terribly mistaken." “就我所知，Python内存中的数据不会在所有请求中都保留下来，除非我犯了一个非常严重的错误。”

you are mistaken. 你误会了。

the whole point of using uwsgi over, say, the CGI mechanism is to persist data across threads and save the overhead of initialization for each call. 例如，在CGI机制上使用uwsgi的全部目的是在线程之间保留数据，并节省每次调用的初始化开销。 you must set processes = 1 in your .ini file, or, depending on how uwsgi is configured, it might launch more than 1 worker process on your behalf. 您必须在.ini文件中设置processes = 1 ，或者根据uwsgi的配置方式，它可能代表您启动多个工作进程。 log the env and look for 'wsgi.multiprocess': False and 'wsgi.multithread': True , and all uwsgi.core threads for the single worker should show the same data. 日志的env ，寻找'wsgi.multiprocess': False和'wsgi.multithread': True ，所有uwsgi.core为单一工作线程应该表现出相同的数据。

you can also see how many worker processes, and "core" threads under each, you have by using the built-in stats-server . 您还可以使用内置的stats-server来查看每个工作进程中有多少个工作进程和“核心”线程。

that's why uwsgi provides lock and unlock functions for manipulating data stores by multiple threads. 这就是uwsgi提供lock和unlock功能以通过多个线程操纵数据存储的原因。

you can easily test this by adding a /status route in your app that just dumps a json representation of your global data object, and view it every so often after actions that update the store. 您可以通过在应用程序中添加一个/status路由（转储全局数据对象的json表示形式）并在更新存储的操作后经常查看的/status来轻松地对此进行测试。