搜索数百万个JSON文件的最佳方法是什么？

Question

I've very recently picked up programming in Python and am working on creating a database. 我最近刚开始使用Python进行编程，并且正在创建数据库。

I've already worked out extracting all these files from their source so they are all in a directory on my computer. 我已经从它们的源中提取出所有这些文件，因此它们都在我计算机的目录中。

All of these files are structured the same way and what I want to do is search these multidimensional dictionaries and locate the value for a specific set of keys. 所有这些文件的结构都相同，我要做的是搜索这些多维词典并找到一组特定键的值。

These json files are all structured similarly, 这些json文件的结构都相似，

{
"userid": 34535367,
"result": {
    "list": [
        {
            "name": 264,
            "age": 64,
            "id": 456345345
        },
        {
            "name": 263,
            "age": 42,
            "id": 364563463456
        }
    ]
}

} }

In my case, I would like to search for the "name" key and return the relevant data(quality, id and the original userid) for the thousands of names just like it from my millions of JSON files. 就我而言，我想搜索“名称”键，并从数百万个JSON文件中返回数千个名称的相关数据（质量，ID和原始用户ID）。

Basically I'm very new at this and the little programming knowledge I have is in Python. 基本上，我对此很陌生，我对Python的了解很少。 I'm happy to start learning whatever I need to, but I'm not sure which direction to go. 我很高兴开始学习所需的任何知识，但是我不确定该朝哪个方向前进。

Answer 1

If your goal is to create a database, then you should look on how databases work and solve the same problem you are trying to solve right now :) 如果您的目标是创建数据库，那么您应该查看数据库的工作方式并解决您正在尝试解决的相同问题：）

NoSQL databases (like mangodb) work also with json documents and implements most likely a whole set of tools to search and filter documents. NoSQL数据库（如mangodb）也可以与json文档一起使用，并且很可能实现了整套工具来搜索和过滤文档。

Now to answer your question, there is no quick way to do so unless you do some preprocessing, meaning that you store different information about the data (called metadata). 现在回答您的问题，没有快速的方法，除非您进行一些预处理，这意味着您存储有关数据的不同信息（称为元数据）。 This is a huge subject and I don't have enough expertise to give you all the answers, but I can give you a simple tip: Use indexes. 这是一个很大的主题，我没有足够的专业知识为您提供所有答案，但是我可以给您一个简单的提示：使用索引。

An index is a sorted key/value map where for every value, we store the documents that contains that value (or the file + position of the Json document) . 索引是排序的键/值映射，其中对于每个值，我们存储包含该值的文档（或Json文档的文件+位置）。 For example an index for the name property would like this: 例如，name属性的索引将如下所示：

{
263: ('jsonfile10.json', '0')
264: ('jsonfile10.json', '30'), 
# The json document can be found on the jsonfile10.json file on line 30
}

By keeping an index for the most queried values, you can turn a linear time search into a logarithmic time search not to mention that inserting a new document is much faster. 通过为最查询的值保留索引，您可以将线性时间搜索转换为对数时间搜索，更不用说插入新文档要快得多。 in your case, you seems to only need an index on the name field. 在您的情况下，您似乎只需要在名称字段上建立索引。

Creating/updating the index is done when you insert, update or remove a document. 当您插入，更新或删除文档时，将创建/更新索引。 Using a balanced binary tree can accelerate the updates on the index. 使用平衡的二叉树可以加速索引的更新。

Answer 2

As a suggestion, why don't you just process all the incoming files and insert the data into a database? 作为建议，为什么不处理所有传入文件并将数据插入数据库？ You will have a toolset to query that database. 您将拥有一个查询该数据库的工具集。 SQLite for example will do (as well as any other more sophisticated database): http://www.sqlite.org/ http://docs.python.org/2/library/sqlite3.html 例如，SQLite（以及其他任何更复杂的数据库）都可以使用： http : //www.sqlite.org/ http://docs.python.org/2/library/sqlite3.html

Simple other solution might be to build a file mapping name_id to /file/path . 其他简单的解决方案可能是构建一个将name_id to /file/path映射name_id to /file/path 。 Then you can logarithmically do a binary search by the name id. 然后您可以对数地通过名称id进行二进制搜索。 But I'd still advise using a proper database as maintaining the index will be more cumbersome than doing some inserts/deletes. 但是我仍然建议使用适当的数据库，因为维护索引比进行一些插入/删除操作更为麻烦。

搜索数百万个JSON文件的最佳方法是什么？

问题描述

2 个解决方案

解决方案1
2 2013-10-31 16:51:58

解决方案2
1 2013-10-31 16:40:50

搜索数百万个JSON文件的最佳方法是什么？

问题描述

2 个解决方案

解决方案1 2 2013-10-31 16:51:58

解决方案2 1 2013-10-31 16:40:50

解决方案1
2 2013-10-31 16:51:58

解决方案2
1 2013-10-31 16:40:50