简体   繁体   中英

SPARQL: querying all triples linked to one URL, but not to another

From one RDF file containing statements about multiple chemical compounds, I would like to create one RDF file per chemical compound.

In order to do that, I started off looking for a SPARQL query that can isolate all triples linked to a certain URL, no matter how much nodes are in between.

I started off with a very interesting SPARQL query ( https://stackoverflow.com/a/33290642/5433896 ) such that it would return (hopefully) all triples linked to a chemical compound :d1 in my dataset, but NOT about another compound :d10 :

CONSTRUCT {
   :d1 ?prop ?val .
   ?child ?childProp ?childPropVal . 
   ?someSubj ?incomingChildProp ?child .
}
WHERE {
     :d1 ?prop ?val ;
         (:overrides|!:overrides)+ ?child . 
     ?child ?childProp ?childPropVal.
     ?someSubj ?incomingChildProp ?child. 
}

However, when I ran this on my simplified test case (python):

rdf = """<?xml version="1.0"?>


<!DOCTYPE rdf:RDF [
    <!ENTITY owl "http://www.w3.org/2002/07/owl#" >
    <!ENTITY owl11 "http://www.w3.org/2006/12/owl11#" >
    <!ENTITY xsd "http://www.w3.org/2001/XMLSchema#" >
    <!ENTITY owl11xml "http://www.w3.org/2006/12/owl11-xml#" >
    <!ENTITY carcinogenesis "http://dl-learner.org/carcinogenesis#" >
    <!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#" >
    <!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
]>


<rdf:RDF xmlns="http://dl-learner.org/carcinogenesis#"
     xml:base="http://dl-learner.org/carcinogenesis"
     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
     xmlns:owl11="http://www.w3.org/2006/12/owl11#"
     xmlns:carcinogenesis="http://dl-learner.org/carcinogenesis#"
     xmlns:owl11xml="http://www.w3.org/2006/12/owl11-xml#"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <owl:Ontology rdf:about=""/>

    <owl:ObjectProperty rdf:about="#hasAtom">
        <rdfs:domain rdf:resource="#Compound"/>
        <rdfs:range rdf:resource="#Atom"/>
    </owl:ObjectProperty>

    <owl:ObjectProperty rdf:about="#hasBond">
        <rdfs:domain rdf:resource="#Compound"/>
        <rdfs:range rdf:resource="#Bond"/>
    </owl:ObjectProperty>

    <owl:ObjectProperty rdf:about="#hasStructure">
        <rdfs:domain rdf:resource="#Compound"/>
        <rdfs:range rdf:resource="#Structure"/>
    </owl:ObjectProperty>

    <owl:ObjectProperty rdf:about="#inBond">
        <rdfs:range rdf:resource="#Atom"/>
        <rdfs:domain rdf:resource="#Bond"/>
    </owl:ObjectProperty>

    <owl:DatatypeProperty rdf:about="#charge">
        <rdfs:domain rdf:resource="#Atom"/>
        <rdfs:range rdf:resource="&xsd;double"/>
    </owl:DatatypeProperty>

    <owl:DatatypeProperty rdf:about="#isMutagenic">
        <rdfs:domain rdf:resource="#Compound"/>
        <rdfs:range rdf:resource="&xsd;boolean"/>
    </owl:DatatypeProperty>

    <owl:Class rdf:about="#Atom"/>

    <owl:Class rdf:about="#Bond">
        <owl:disjointWith rdf:resource="#Structure"/>
        <owl:disjointWith rdf:resource="#Atom"/>
    </owl:Class>

    <owl:Class rdf:about="#Bond-7">
        <rdfs:subClassOf rdf:resource="#Bond"/>
    </owl:Class>

    <owl:Class rdf:about="#Carbon-22">
        <rdfs:subClassOf rdf:resource="#Carbon"/>
    </owl:Class>

    <owl:Class rdf:about="#Compound">
        <owl:disjointWith rdf:resource="#Structure"/>
        <owl:disjointWith rdf:resource="#Atom"/>
        <owl:disjointWith rdf:resource="#Bond"/>
    </owl:Class>

    <owl:Class rdf:about="#Six_ring">
        <rdfs:subClassOf rdf:resource="#Ring"/>
    </owl:Class>

    <owl:Class rdf:about="#Ring">
        <rdfs:subClassOf rdf:resource="#Structure"/>
    </owl:Class>

    <owl:Class rdf:about="#Structure">
        <owl:disjointWith rdf:resource="#Atom"/>
    </owl:Class>

    <Compound rdf:about="#d1">
        <hasBond rdf:resource="#bond1"/>
        <hasAtom rdf:resource="#d1_2"/>
        <hasAtom rdf:resource="#d1_3"/>
        <hasStructure rdf:resource="#six_ring-1"/>
        <isMutagenic rdf:datatype="&xsd;boolean">false</isMutagenic>
    </Compound>

    <Bond-7 rdf:about="#bond1">
        <inBond rdf:resource="#d1_3"/>
        <inBond rdf:resource="#d1_2"/>
    </Bond-7>

    <Carbon-22 rdf:about="#d1_2">
        <charge rdf:datatype="&xsd;double">-0.133</charge>
    </Carbon-22>

    <Carbon-22 rdf:about="#d1_3">
        <charge rdf:datatype="&xsd;double">-0.0030</charge>
    </Carbon-22>

    <Six_ring rdf:about="#six_ring-1"/>

    <Compound rdf:about="#d10">
        <hasBond rdf:resource="#bond40"/>
        <hasAtom rdf:resource="#d10_12"/>
        <hasAtom rdf:resource="#d10_13"/>
        <isMutagenic rdf:datatype="&xsd;boolean">false</isMutagenic>
        <hasStructure rdf:resource="#six_ring-9"/>
    </Compound>

    <Bond-1 rdf:about="#bond40">
        <inBond rdf:resource="#d10_12"/>
        <inBond rdf:resource="#d10_13"/>
    </Bond-1>

    <Six_ring rdf:about="#six_ring-9"/>

    <Nitrogen-32 rdf:about="#d10_12">
        <charge rdf:datatype="&xsd;double">-0.313</charge>
    </Nitrogen-32>

    <Nitrogen-32 rdf:about="#d10_13">
        <charge rdf:datatype="&xsd;double">-0.313</charge>
    </Nitrogen-32>

</rdf:RDF>
"""

# Inspired by https://stackoverflow.com/a/33290642/5433896:

sparql_query = """CONSTRUCT {
   :d1 ?prop ?val .
   ?child ?childProp ?childPropVal . 
   ?someSubj ?incomingChildProp ?child .
}
WHERE {
     :d1 ?prop ?val ;
         (:overrides|!:overrides)+ ?child . 
     ?child ?childProp ?childPropVal.
     ?someSubj ?incomingChildProp ?child. 
}
"""

# Trying this query out:
import rdflib
import logging
logger = logging.getLogger()
logger.setLevel("INFO")

graph = rdflib.Graph()
graph.parse(data=rdf, format='xml')
result = graph.query(sparql_query)
for s, p, o in result:
    print(s, p, o)
    if s.endswith('#d10') or s.endswith('#bond40') or s.endswith('#six_ring-9') or s.endswith('#d10_12') or s.endswith('#d10_13'):
        logging.error('This triple should not be in the results! => {0} {1} {2}.'.format(s, p, o))

I get two errors that I want to avoid:

ERROR:root:This triple should not be in the results! => http://dl-learner.org/carcinogenesis#six_ring-9 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dl-learner.org/carcinogenesis#Six_ring.

ERROR:root:This triple should not be in the results! => http://dl-learner.org/carcinogenesis#d10 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dl-learner.org/carcinogenesis#Compound.

I found out the answer to this question when adding extra things I had already noticed.

Since :d1 rdf:type :Compound and :d10 rdf:type :Compound , there is actually a path from :d1 to :d10 and therefore triples about :d10 also end up in my query results - which is of course not what I wanted.

Looking at the query, I thought it would make sense to add an additional constraint that ?child mustn't be :d10 . And, thinking about the way the poster of https://stackoverflow.com/a/33290642/5433896 explained his query, I should also exclude that ?someSubj or ?childPropVal can be :d10 :

CONSTRUCT {
   :d1 ?prop ?val .
   ?child ?childProp ?childPropVal . 
   ?someSubj ?incomingChildProp ?child .
}
WHERE {
     :d1 (:overrides|!:overrides)+ ?child . 
     ?child ?childProp ?childPropVal.
     ?someSubj ?incomingChildProp ?child.
     FILTER (?child != :d10)
     FILTER (?childPropVal != :d10)
     FILTER (?someSubj != :d10)
}

This removed :d10 from my query results. Great!

But the error ERROR:root:This triple should not be in the results! => http://dl-learner.org/carcinogenesis#six_ring-9 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dl-learner.org/carcinogenesis#Six_ring. ERROR:root:This triple should not be in the results! => http://dl-learner.org/carcinogenesis#six_ring-9 http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://dl-learner.org/carcinogenesis#Six_ring. remains.

I played with removal of some plausible triples causing this issue and found out that the triple { Six_ring rdf:about="#six_ring-9" } is the cause of still having #six_ring-9 in the query results. So, again, rdf:type (implied in rdf:about ) is causing the problem.

Ideally, we would need describe in SPARQL that we are interested to know what the rdf:types are of objects linked to :d1 (eg :Compound , :Six_Ring ), but NOT what other objects are also linked to those object types. That would solve BOTH problems we initially detected with the query.

So this query solves the issue:

CONSTRUCT {
   :d1 ?prop ?val .
   ?child ?childProp ?childPropVal . 
   ?someSubj ?incomingChildProp ?child .
}
WHERE {
     :d1 (:overrides|!:overrides)+ ?child . 
     ?child ?childProp ?childPropVal.
     ?someSubj ?incomingChildProp ?child.
     FILTER (?incomingChildProp != rdf:type)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM