简体   繁体   中英

SPARQL query to return number of neighbor

I need to find just the amount of neighbor (up to 4 nodes away) of a given article in DBPedia (2 articles are neighbors when there's a wikilink between them). Currently I'm doing this query but it takes a lot of time to compute:

SELECT COUNT(?n4)
WHERE {
    SELECT DISTINCT ?n4
    WHERE {
        <http://dbpedia.org/resource/Albert_Einstein> dbo:wikiPageWikiLink/dbo:wikiPageWikiLink/dbo:wikiPageWikiLink/dbo:wikiPageWikiLink ?n4 .
    }
}

Anyone has any idea what's a better way to do that? I only need the amount of neighbors. That query only works fast till degree 2, from 3 it takes almost 30 sec to complete and 4 is almost always timeout.

I'm using RDFLib and Python to do the query, so any trick with Python would also be helpful!

EDIT: I have already download the dataset and setup a local endpoint for the query, but the performance is still low.

If you are going to do lots of repeated queries for neighbors that are 4 steps away, you could put all the computational effort into a single, one-time equivalent property calculation:

PREFIX ex: <http://example.com/>

CONSTRUCT {
  ?x ex:fourthNeighbour ?y .
}
WHERE {
  ?x dbo:wikiPageWikiLink/dbo:wikiPageWikiLink/dbo:wikiPageWikiLink/dbo:wikiPageWikiLink ?y .
}

This will still take a long time to run however you will only need to do it once and then any queries for 4-step neighbours will be much faster.

SPARQL 1.1 Property Paths can have a very high time and space complexity, see the paper Counting Beyond a Yottabyte, or how SPARQL 1.1 Property Paths will Prevent Adoption of the Standard

Your query has a maximum complexity of O(n^4), where n is the number of articles in DBpedia, which is a lot. The specific runtime depends on the network structure of the data. Imagine John has 100 friends, then the friends of degree 4 can be up to (including duplicates) 100^4 = 10^8 = 100 million.

Additionally, RDFLib has a very low performance in my testing in comparison to a dedicated triple store such as Virtuoso Opensource 7.

However if even that is not enough you could try dedicated graph theory tools and libraries, like NetworkX, Gephy and Cytoscape. While RDF is also a graph data model, the triple stores may not be optimized for that kind of query.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM