简体繁体 English

Python：寻找充当数据库缓存的网络服务器设置的建议

[英]Python: Looking for recommendations for a webserver-setup acting as a database-cache

原文 2020-11-01 11:46:25 8 1 python/ dataframe/ centos7/ gunicorn

I am looking for a lightweight webserver-setup acting as an REST-API cache for multiple external databases.我正在寻找一个轻量级的网络服务器设置，作为多个外部数据库的 REST-API 缓存。

Fixed requirements for the project: use Python3 on CentOS7项目固定需求：在CentOS7上使用Python3

Guideline: Create a lightweight webserver, which needs to be robust and fast指南：创建一个轻量级的网络服务器，它需要健壮和快速

Use-case scenario: During service-start I need to cache data from 10 external database-servers in RAM.用例场景：在服务启动期间，我需要在 RAM 中缓存来自 10 个外部数据库服务器的数据。 Each server has 5 tables with ca.每个服务器有 5 个表，每个表。 100k rows each.每个 100k 行。 So in sum I need to merge the data into 5 tables with ca.所以总而言之，我需要将数据合并到 5 个带有 ca 的表中。 1 mio. 1 米欧。 entries.条目。 Every 10 mins I need to query the servers again to identify new/removed rows and update the cache.每 10 分钟我需要再次查询服务器以识别新/删除的行并更新缓存。 The webserver will receive requests to lookup a single entry from the cache, filtered by table and a given search-condition (like "field_1" ="value_X").网络服务器将接收从缓存中查找单个条目的请求，通过表和给定的搜索条件（如“field_1”=“value_X”）进行过滤。

Expected web-load: avg.预期网络负载：平均。 1 request/sec., (rare) peak-load ca. 1 请求/秒，（罕见）峰值负载约。 100 requests/sec. 100 个请求/秒。

Now my question to the above scenario:现在我对上述情况的问题：

I can get the data form the DB-servers as json, xml or csv.我可以从数据库服务器获取 json、xml 或 csv 格式的数据。 Which format is the recommended one for the use case (fast "inserts" into a table with 1 mio rows)?对于用例（快速“插入”到具有 1 个 mio 行的表中），推荐哪种格式？
How shoud I store the data in the memory?我应该如何将数据存储在内存中？ pandas dataframes?熊猫数据框？
In sum, what is the recommended framework for all this?总之，所有这些的推荐框架是什么？ pandas, gunicorn, supervisor & nginx?熊猫、枪械、主管和 nginx？

Many thanks for any input.非常感谢您的任何意见。

1 个解决方案

To deserialize your data, CSV will be the fastest method in most cases.在大多数情况下，要反序列化您的数据， CSV 将是最快的方法。 This allows you to read multiple lines in different threads and reduces complexity.这允许您读取不同线程中的多行并降低复杂性。

To store the data, I would recommend going either with the most performant solution, which likely is using an existing (No)SQL database implementation, or going the programatically easier way using SQLite in-memory database.为了存储数据，我建议使用性能最高的解决方案，这可能是使用现有的 (No)SQL 数据库实现，或者使用 SQLite 内存数据库以编程方式更简单。 Pandas is better for analysis, while I understand that you want to get a similar functionality like a normal DBMS to just fetch data. Pandas 更适合分析，而我知道您希望获得类似于普通 DBMS 的类似功能来获取数据。 SQLite is faster (and easier) than Pandas for those use cases.对于这些用例， SQLite比 Pandas 更快（也更容易）。

In your use case, I would recommend using the FastAPI library for servicing the API automatically in multiple threads within Python.在您的用例中，我建议使用FastAPI 库在 Python 中的多个线程中自动为 API 提供服务。 You do not need another webserver in front of it, unless you want to do caching there.你不需要在它前面有另一个网络服务器，除非你想在那里做缓存。 The script can access the in-memory database or dedicated DBMS application from within those threads.该脚本可以从这些线程中访问内存数据库或专用 DBMS 应用程序。 Using supervisor depends on your use case - inside a Container or if the script runs as a service it will not be needed.使用 supervisor 取决于您的用例 - 在容器内或者如果脚本作为服务运行，则不需要它。