简体   繁体   English

在GraphDB上加快SPARQL查询

[英]Speeding up SPARQL query on GraphDB

I'm trying to speed up and to optimize this query 我正在尝试加快并优化此查询

select distinct ?root where { 
    ?root a :Root ;
          :hasnode* ?node ;
          :hasnode* ?node2 .

    ?node a :Node ;
           :hasAnnotation ?ann .
    ?ann :hasReference ?ref .
    ?ref a :ReferenceType1 .

    ?node2 a :Node ;
            :hasAnnotation ?ann2 .
    ?ann2 :hasReference ?ref2 .
    ?ref2 a :ReferenceType2 .

}

Basically, I'm analyzing some trees and I want to get all trees (ie, trees' roots) which do have at least a couple of underlying nodes with a pattern like this one: 基本上,我正在分析一些树木,我想获得所有树木(即树木的根),这些树木至少具有几个具有这种模式的基础节点:

?node_x a :Node ;
       :hasAnnotation ?ann_x .
?ann_x :hasReference ?ref_x .
?ref_x a :ReferenceTypex .

one with x = 1 and the other with x = 2 . 一个x = 1 ,另一个x = 2

Since in my graph one node may have at most one :hasAnnotation predicate, I do not have to specify that those nodes must be different. 由于在我的图中一个节点最多可以具有一个:hasAnnotation谓词,因此我不必指定那些节点必须不同。

The problem 问题

The aforementioned query describes what I need but does have a very bad performance. 上面的查询描述了我所需要的,但是确实表现很差。 After minutes and minutes of execution, it is still running. 经过数分钟的执行,它仍然在运行。

My (ugly) solution: breaking it in half 我的(丑陋的)解决方案:将其分成两半

I noticed that if a look for a node pattern at a time, I get my result in some seconds(!). 我注意到,如果一次查找一个节点模式,我会在几秒钟内得到结果(!)。

Sadly enough, my current approach consists in running the following query type twice: 令人遗憾的是,我当前的方法是两次运行以下查询类型:

select distinct ?root where { 
    ?root a :Root ;
          :hasnode* ?node .

    ?node a :Node ;
           :hasAnnotation ?ann_x .
    ?ann_x :hasReference ?ref_x .
    ?ref_x a :ReferenceTypex .
}

one with x = 1 and the other with x = 2 . 一个x = 1 ,另一个x = 2

Saving partial results (ie, ?root s) in 2 sets, let's say R1 and R2 and finally calculating intersection between those resultsets. 将部分结果(即?root )保存在2组中,假设R1R2并最终计算出这些结果集之间的交集。

Is there a way to speed up my initial approach to get results just by leveraging SPARQL? 有没有一种方法可以仅利用SPARQL来加快我的最初方法以获得结果的速度?

PS: I'm working with GraphDB. PS:我正在使用GraphDB。

Well, putting together auto-hint :) and Stanislav's suggestion I came up with a solution. 好吧,把自动提示:)和斯坦尼斯拉夫的建议放在一起,我想出了一个解决方案。

Solution 1 nested query 解决方案1嵌套查询

Nesting the query in the following way, I get the result in 15s . 通过以下方式嵌套查询,我将在15s得到结果。

select distinct ?root where { 
    ?root a :Root ;
          :hasnode* ?node .
    ?node a :Node ;
          :hasAnnotation ?ann .
    ?ann :hasReference ?ref .
    ?ref a :ReferenceType1 .
    {
        select distinct ?root where { 
            ?root a :Root ;
                  :hasnode* ?node2 .
            ?node2 a :Node ;
                   :hasAnnotation ?ann2 .
            ?ann2 :hasReference ?ref2 .
            ?ref2 a :ReferenceType2 .
        }
    }
}

Solution 2: groups into {} 解决方案2:分组为{}

Grouping parts into {} , as suggested by Stanislav's, took 60s . 根据斯坦尼斯拉夫的建议,将零件分组为{}耗时60s s。

select distinct ?root where { 
    {
    ?root a :Root ;
          :hasnode* ?node .

    ?node a :Node ;
           :hasAnnotation ?ann .
    ?ann :hasReference ?ref .
    ?ref a :ReferenceType1 .
    }
    {
        ?root a :Root ;
          :hasnode* ?node2 .

              ?node2 a :Node ;
            :hasAnnotation ?ann2 .
    ?ann2 :hasReference ?ref2 .
    ?ref2 a :ReferenceType2 .
    }
}

Probably GraphDB's optimizer builds a query plan more effective for my data in the first case (explanations are welcome). 在第一种情况下,GraphDB的优化器可能会为我的数据建立更有效的查询计划(欢迎解释)。

I've ever thought about SPARQL in a 'declarative' way, but it seems like there is a massive variability in performance respect to the way you write your SPARQL. 我曾经以“声明式”方式考虑过SPARQL,但似乎在性能方面与编写SPARQL的方式有关。 Coming from SQL, it seems to me that such a performance variability is much greater than what it happens in the relational world. 从SQL来看,在我看来,这种性能差异要比关系世界中发生的变化大得多。

However, reading this post , it seems I'm not sufficiently aware of SPARQL optimizer dynamics. 但是,阅读这篇文章后 ,似乎我对SPARQL优化器的动态性还不够了解。 :) :)

Without knowing the specific dataset I can give you only some general directions how to optimize the query: 在不知道特定数据集的情况下,我只能为您提供一些如何优化查询的一般指导:

Avoid using DISTINCT for large datasets 避免对大型数据集使用DISTINCT

The GraphDB query optimiser will not rewrite automatically the query to use EXISTS for all patterns not participating in the projection. GraphDB查询优化器不会自动将查询重写为对所有不参与投影的模式使用EXISTS。 The query semantics is to find that there is at least one such pattern, but not give me all bindings and then eliminate the duplicated results. 查询语义是要找到至少一个这样的模式,但不要给我所有绑定,然后消除重复的结果。

Materialize the property paths 实现属性路径

GraphDB has a very efficient forward chaining reasoner and relatively not so optimised property path expansion. GraphDB具有非常有效的前向链接推理器,相对而言还没有那么优化的属性路径扩展。 If you are not concerned for the write/data update performance, I suggest you to declare :hasNode as a transitive property (see owl:TransitiveProperty in query ), which will eliminate the property path wildcard. 如果您不关心写入/数据更新的性能,建议您将:hasNode声明为可传递属性(请参阅query中的owl:TransitiveProperty ),这将消除属性路径通配符。 This will boost many times the query speed. 这将使查询速度提高很多倍。

Your final query should look like: 您的最终查询应如下所示:

select ?root where { 
    ?root a :Root ;
          :hasnode ?node ;
          :hasnode ?node2 .

    FILTER (?node != ?node2)

    FILTER EXISTS {
        ?node a :Node ;
               :hasAnnotation ?ann .
        ?ann :hasReference ?ref .
        ?ref a :ReferenceType1 .
    }

    FILTER EXISTS {
        ?node2 a :Node ;
                :hasAnnotation ?ann2 .
        ?ann2 :hasReference ?ref2 .
        ?ref2 a :ReferenceType2 .
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM