简体   繁体   English

如何在ArangoDB 2.7中提高检索查询性能

[英]How to improve the retrieve Query performance in ArangoDB 2.7

I am beginner in python and ArangoDB. 我是python和ArangoDB的初学者。 I have strored the data in ArangoDB on Single colletion name "DSP". 我在单个集合名称“DSP”上使用ArangoDB中的数据。 My query is : 我的查询是:

for k in 
    (for t in DSP return [t.data])
        for z in k
           for p in z
              filter p.name == "name" || 
                     p.content == "pdf" ||
                     p.content == "xml" ||
                     p.name == "Book"
              return p

and the json data which in have stored: in in the format like 和已存储的json数据:in以类似的格式

{"data": [{"content": "Java", "type": "string", "name": "name", "key": 1}, {"content": "D:/Java", "type": "string", "name": "location", "key": 1}, {"content": "File folder", "type": "string", "name": "type", "key": 1}, {"content": 1896038645, "type": "int", "name": "size", "key": 1}, {"content": 7, "type": "string", "name": "child_folder_count", "key": 1}, {"content": 7, "type": "string", "name": "child_file_count", "key": 1}, {"content": "parse_dir.py", "type": "string", "name": "name", "key": 101}, {"content": "D:/Java/parse_dir.py", "type": "string", "name": "location", "key": 101}, {"content": "py", "type": "string", "name": "mime-type", "key": 101}, {"content": 4032, "type": "string", "name": "size", "key": 101}, {"content": "Wed Dec 30 21:36:32 2015", "type": "string", "name": "created_date", "key": 101}, {"content": "Wed Dec 30 21:42:38 2015", "type": "string", "name": "modified_date", "key": 101}, {"content": "result.json", "type": "string", "name": "name", "key": 102}, {"content": "D:/Java/result.json", "type": "string", "name": "location", "key": 102}, {"content": "json", "type": "string", "name": "mime-type", "key": 102}, {"content": 1134450, "type": "string", "name": "size", "key": 102}, {"content": "Wed Dec 30 21:36:45 2015", "type": "string", "name": "created_date", "key": 102}, {"content": "Wed Dec 30 21:36:45 2015", "type": "string", "name": "modified_date", "key": 102}, {"content": "rmi1.rar", "type": "string", "name": "name", "key": 103}, {"content": "D:/Java/rmi1.rar", "type": "string", "name": "location", "key": 103}, {"content": "rar", "type": "string", "name": "mime-type", "key": 103}, {"content": 165116, "type": "string", "name": "size", "key": 103}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 103}, {"content": "Tue Aug 30 16:18:34 2011", "type": "string", "name": "modified_date", "key": 103}, {"content": "servlet.rar", "type": "string", "name": "name", "key": 104}, {"content": "D:/Java/servlet.rar", "type": "string", "name": "location", "key": 104}, {"content": "rar", "type": "string", "name": "mime-type", "key": 104}, {"content": 782, "type": "string", "name": "size", "key": 104}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 104}, {"content": "Tue Aug 30 16:18:30 2011", "type": "string", "name": "modified_date", "key": 104}, {"content": "crawler projects", "type": "string", "name": "name", "key": 2}, {"content": "D:/Java/crawler projects", "type": "string", "name": "location", "key": 2}, {"content": "File folder", "type": "string", "name": "type", "key": 2}, {"content": 1886842316, "type": "int", "name": "size", "key": 2}, {"content": 5, "type": "string", "name": "child_folder_count", "key": 2}, {"content": 5, "type": "string", "name": "child_file_count", "key": 2}, {"content": ".metadata", "type": "string", "name": "name", "key": 3}, {"content": "D:/Java/crawler projects/.metadata", "type": "string", "name": "location", "key": 3}, {"content": "File folder", "type": "string", "name": "type", "key": 3}, {"content": 10131546, "type": "int", "name": "size", "key": 3}, {"content": 2, "type": "string", "name": "child_folder_count", "key": 3}, {"content": 2, "type": "string", "name": "child_file_count", "key": 3}, {"content": ".lock", "type": "string", "name": "name", "key": 301}, {"content": "D:/Java/crawler projects/.metadata/.lock", "type": "string", "name": "location", "key": 301}, {"content": "", "type": "string", "name": "mime-type", "key": 301}, {"content": 0, "type": "string", "name": "size", "key": 301}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 301}, {"content": "Mon May 30 12:21:45 2011", "type": "string", "name": "modified_date", "key": 301}, {"content": ".log", "type": "string", "name": "name", "key": 302}, {"content": "D:/Java/crawler projects/.metadata/.log", "type": "string", "name": "location", "key": 302}, {"content": "", "type": "string", "name": "mime-type", "key": 302}, {"content": 598, "type": "string", "name": "size", "key": 302}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 302}, {"content": "Mon May 30 15:29:18 2011", "type": "string", "name": "modified_date", "key": 302}, {"content": "version.ini", "type": "string", "name": "name", "key": 303}, {"content": "D:/Java/crawler projects/.metadata/version.ini", "type": "string", "name": "location", "key": 303}, {"content": "ini", "type": "string", "name": "mime-type", "key": 303}, {"content": 26, "type": "string", "name": "size", "key": 303}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 303}, {"content": "Mon May 30 15:29:18 2011", "type": "string", "name": "modified_date", "key": 303}, {"content": ".mylyn", "type": "string", "name": "name", "key": 4}, {"content": "D:/Java/crawler projects/.metadata/.mylyn", "type": "string", "name": "location", "key": 4}, {"content": "File folder", "type": "string", "name": "type", "key": 4}, {"content": 920, "type": "int", "name": "size", "key": 4}, {"content": 1, "type": "string", "name": "child_folder_count", "key": 4}, {"content": 1, "type": "string", "name": "child_file_count", "key": 4}, {"content": ".tasks.xml.zip", "type": "string", "name": "name", "key": 401}, {"content": "D:/Java/crawler projects/.metadata/.mylyn/.tasks.xml.zip", "type": "string", "name": "location", "key": 401}, {"content": "zip", "type": "string", "name": "mime-type", "key": 401}, {"content": 250, "type": "string", "name": "size", "key": 401}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 401}, {"content": "Mon May 30 12:23:18 2011", "type": "string", "name": "modified_date", "key": 401}, {"content": "repositories.xml.zip", "type": "string", "name": "name", "key": 402}, {"content": "D:/Java/crawler projects/.metadata/.mylyn/repositories.xml.zip", "type": "string", "name": "location", "key": 402}, {"content": "zip", "type": "string", "name": "mime-type", "key": 402}, {"content": 420, "type": "string", "name": "size", "key": 402}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 402}, {"content": "Mon May 30 12:23:18 2011", "type": "string", "name": "modified_date", "key": 402}, {"content": "tasks.xml.zip", "type": "string", "name": "name", "key": 403}, {"content": "D:/Java/crawler projects/.metadata/.mylyn/tasks.xml.zip", "type": "string", "name": "location", "key": 403}, {"content": "zip", "type": "string", "name": "mime-type", "key": 403}, {"content": 250, "type": "string", "name": "size", "key": 403}, {"content": "Sun Aug 25 07:29:52 2013", "type": "string", "name": "created_date", "key": 403}, {"content": "Mon May 30 15:31:16 2011", "type": "string", "name": "modified_date", "key": 403}, {"content": "contexts", "type": "string", "name": "name", "key": 5}, {"content": "D:/Java/crawler projects/.metadata/.mylyn/contexts", "type": "string", "name": "location", "key": 5}, {"content": "File folder", "type": "string", "name": "type", "key": 5}, {"content": 0, "type": "int", "name": "size", "key": 5}, {"content": 0, "type": "string", "name": "child_folder_count", "key": 5}]

As i am adding the json documents approx 100 of the json document of approx 15 MB each, or adding more n more filter conditions. 因为我正在添加json文档大约100个json文档,每个大约15 MB,或者添加更多n个过滤条件。 The query take more than 1 minute of time, and some time the Browser is not responding. 查询需要1分钟以上的时间,有时浏览器没有响应。

I am doing this experiment on Intel core i3 2.4 GHz, 4 GB RAM, and 160GB SATA Hard drive. 我在英特尔酷睿i3 2.4 GHz,4 GB内存和160 GB SATA硬盘上进行了这项实验。

Kindly tell me, First, how to improve the performance of the query? 请告诉我,首先,如何提高查询的性能? Whether i need to change my storage structure or change the syntax of my query. 我是否需要更改存储结构或更改查询的语法。 and how to perform the join operations on the multiple documents which has the same key, For example, "retrieve the name of document of type xml". 以及如何对具有相同键的多个文档执行连接操作,例如,“检索xml类型的文档名称”。

There should be a few ways to improve this query's performance: 应该有几种方法来改善此查询的性能:

  • selecting all documents from collection DSP via a subquery and then iterating over them ( for k in (for t in DSP return [t.data]) for z in k for p in z filter p.name == "name" ... ) may be less efficient than using the documents directly. 通过子查询从集合DSP选择所有文档,然后对它们进行迭代( for k in (for t in DSP return [t.data]) for z in k for p in z filter p.name == "name" ... )可能比直接使用文档效率低。 Try replacing the 4 FOR loops and the subquery with just FOR k IN DSP FOR p IN k.data FILTER p.name == "name" ... ) 尝试用FOR k IN DSP FOR p IN k.data FILTER p.name == "name" ...替换4 FOR循环和子查询FOR k IN DSP FOR p IN k.data FILTER p.name == "name" ...

  • if you look at the query's explain output it will show that no index will be used. 如果查看查询的explain输出,它将显示将不使用索引。 If you have lots of documents in the collection and only want to retrieve a few of them with a query, an index will help performance-wise. 如果集合中有大量文档,并且只想通过查询检索其中的一些文档,那么索引将有助于提高性能。 I suggest using an array index on data[*].name and one on data[*].content . 我建议在data[*].name上使用数组索引,在data[*].content You can set them up like this: db.DSP.ensureIndex({ type: "hash", fields: [ "data[*].name" ] }); db.DSP.ensureIndex({ type: "hash", fields: [ "data[*].content" ] }); 您可以像这样设置它们: db.DSP.ensureIndex({ type: "hash", fields: [ "data[*].name" ] }); db.DSP.ensureIndex({ type: "hash", fields: [ "data[*].content" ] }); db.DSP.ensureIndex({ type: "hash", fields: [ "data[*].name" ] }); db.DSP.ensureIndex({ type: "hash", fields: [ "data[*].content" ] }); . Note: these types of indexes require ArangoDB 2.8. 注意:这些类型的索引需要ArangoDB 2.8。 With these indexes, the query can also be simplified to: FOR p in DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content... 使用这些索引,查询也可以简化为: FOR p in DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content... FOR p in DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content... FOR p in DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content... . FOR p in DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content... Note that indexes will only help you to quickly find the documents containing the search data, but not the parts of the document that contain it. 请注意,索引只能帮助您快速查找包含搜索数据的文档,但不能帮助您快速查找包含搜索数据的文档部分。

  • it may be helpful to adjust the document structure. 调整文档结构可能会有所帮助。 Your current structure seems to contain multiple content and name values per document, eg [ {"content": "Java", "type": "string", "name": "name", "key": 1}, {"content": "D:/Java", "type": "string", "name": "location", "key": 1} ] . 您当前的结构似乎包含每个文档的多个contentname值,例如[ {"content": "Java", "type": "string", "name": "name", "key": 1}, {"content": "D:/Java", "type": "string", "name": "location", "key": 1} ] It looks like each document has only a data property which is an array these structures. 看起来每个文档只有一个data属性,这是一个数组这些结构。 Instead of using this structure, you may try saving each array value as a separate document. 您可以尝试将每个数组值保存为单独的文档,而不是使用此结构。 For example, {"content": "Java", "type": "string", "name": "name", "key": 1} would become a document of its own, {"content": "D:/Java", "type": "string", "name": "location", "key": 1} would become another document etc. This seems sensible as your sub-structures seem to have a key attribute already and several array values seem to refer to the same key value. 例如, {"content": "Java", "type": "string", "name": "name", "key": 1}将成为自己的文档, {"content": "D:/Java", "type": "string", "name": "location", "key": 1}将成为另一个文档等。这似乎是明智的,因为你的子结构似乎已经有一个key属性和几个数组值似乎指的是相同的key The transformation will allow splitting the potentially very big documents into much smaller chunks, and this will not only make the AQL run quicker (as it will need to unpack far less data when accessing a document), but will also allow you to get rid of all the nested loops and locating to the relevant inner array values when returning the result. 转换将允许将可能非常大的文档拆分成更小的块,这不仅会使AQL运行得更快(因为它在访问文档时需要解包少得多的数据),但也可以让你摆脱所有嵌套循环,并在返回结果时定位到相关的内部数组值。

Should you adjust the document structure, your query can then be greatly simplified to just FOR p IN DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content ... RETURN p 如果您调整文档结构,您的查询可以大大简化为FOR p IN DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content ... RETURN p FOR p IN DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content ... RETURN p FOR p IN DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content ... RETURN p and should be fast if indexes are used. FOR p IN DSP FILTER "name" IN p.data[*].name || "Book" IN p.data[*].name || "pdf" IN p.data[*].content ... RETURN p ,如果使用索引,应该很快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM