简体   繁体   English

SPARQL 查询返回邻居数

[英]SPARQL query to return number of neighbor

I need to find just the amount of neighbor (up to 4 nodes away) of a given article in DBPedia (2 articles are neighbors when there's a wikilink between them).我只需要在 DBPedia 中找到给定文章的邻居数量(最多 4 个节点)(当它们之间存在 wikilink 时,2 篇文章是邻居)。 Currently I'm doing this query but it takes a lot of time to compute:目前我正在做这个查询,但需要很多时间来计算:

SELECT COUNT(?n4)
WHERE {
    SELECT DISTINCT ?n4
    WHERE {
        <http://dbpedia.org/resource/Albert_Einstein> dbo:wikiPageWikiLink/dbo:wikiPageWikiLink/dbo:wikiPageWikiLink/dbo:wikiPageWikiLink ?n4 .
    }
}

Anyone has any idea what's a better way to do that?任何人都知道有什么更好的方法可以做到这一点? I only need the amount of neighbors.我只需要邻居的数量。 That query only works fast till degree 2, from 3 it takes almost 30 sec to complete and 4 is almost always timeout.该查询只能快速运行到 2 级,从 3 级开始需要将近 30 秒才能完成,而 4 级几乎总是超时。

I'm using RDFLib and Python to do the query, so any trick with Python would also be helpful!我正在使用 RDFLib 和 Python 进行查询,因此任何使用 Python 的技巧也会有所帮助!

EDIT: I have already download the dataset and setup a local endpoint for the query, but the performance is still low.编辑:我已经下载了数据集并为查询设置了本地端点,但性能仍然很低。

If you are going to do lots of repeated queries for neighbors that are 4 steps away, you could put all the computational effort into a single, one-time equivalent property calculation:如果您要对 4 步外的邻居进行大量重复查询,您可以将所有计算工作放在一个单一的、一次性的等效属性计算中:

PREFIX ex: <http://example.com/>

CONSTRUCT {
  ?x ex:fourthNeighbour ?y .
}
WHERE {
  ?x dbo:wikiPageWikiLink/dbo:wikiPageWikiLink/dbo:wikiPageWikiLink/dbo:wikiPageWikiLink ?y .
}

This will still take a long time to run however you will only need to do it once and then any queries for 4-step neighbours will be much faster.这仍然需要很长时间才能运行,但是您只需要执行一次,然后对 4 步邻居的任何查询都会快得多。

SPARQL 1.1 Property Paths can have a very high time and space complexity, see the paper Counting Beyond a Yottabyte, or how SPARQL 1.1 Property Paths will Prevent Adoption of the Standard SPARQL 1.1 属性路径可能具有非常高的时间和空间复杂度,请参阅论文Counting Beyond a Yottabyte,或 SPARQL 1.1 属性路径将如何阻止标准的采用

Your query has a maximum complexity of O(n^4), where n is the number of articles in DBpedia, which is a lot.您的查询的最大复杂度为 O(n^4),其中 n 是 DBpedia 中的文章数,这是很多。 The specific runtime depends on the network structure of the data.具体的运行时间取决于数据的网络结构。 Imagine John has 100 friends, then the friends of degree 4 can be up to (including duplicates) 100^4 = 10^8 = 100 million.想象John有100个朋友,那么度数为4的朋友可以达到(包括重复)100^4 = 10^8 = 1亿。

Additionally, RDFLib has a very low performance in my testing in comparison to a dedicated triple store such as Virtuoso Opensource 7.此外,在我的测试中,与 Virtuoso Opensource 7 等专用三重存储库相比,RDFLib 的性能非常低。

However if even that is not enough you could try dedicated graph theory tools and libraries, like NetworkX, Gephy and Cytoscape.然而,如果这还不够,您可以尝试专用的图论工具和库,如 NetworkX、Gephy 和 Cytoscape。 While RDF is also a graph data model, the triple stores may not be optimized for that kind of query.虽然 RDF 也是一种图数据模型,但三元组存储可能不会针对这种查询进行优化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM