简体   繁体   中英

Extracting URIs from RDF web page in Java using Jena Library

I have written following code for extratcting URIs from a web page with content type application/rdf-xml for Linked Data application.

public static void test(String url) {
    try {
        Model read = ModelFactory.createDefaultModel().read(url);
        System.out.println("to go");
        StmtIterator si;
        si = read.listStatements();
        System.out.println("to go");
        while(si.hasNext()) {
            Statement s=si.nextStatement();
            Resource r=s.getSubject();
            Property p=s.getPredicate();
            RDFNode o=s.getObject();
            System.out.println(r.getURI());
            System.out.println(p.getURI());
            System.out.println(o.asResource().getURI());
        }
    }
    catch(JenaException | NoSuchElementException c) {}
}

But for the input

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ex="http://example.org/stuff/1.0/">
    <rdf:Description rdf:about="http://www.w3.org/TR/rdf-syntax-grammar"
        dc:title="RDF/XML Syntax Specification (Revised)">
        <ex:editor>
            <rdf:Description ex:fullName="Dave Beckett">
                <ex:homePage rdf:resource="http://purl.org/net/dajobe/" />
            </rdf:Description>
        </ex:editor>
    </rdf:Description>
</rdf:RDF>

The output is :

Subject URI is http://www.w3.org/TR/rdf-syntax-grammar
Predicate  URI is http://example.org/stuff/1.0/editor
Object URI is null
Subject URI is http://www.w3.org/TR/rdf-syntax-grammar
Predicate  URI is http://purl.org/dc/elements/1.1/title
Website is read

I require in the output all the URIs present on that page to build a web crawler for RDF pages. I require all following links in output:

       http://www.w3.org/TR/rdf-syntax-grammar
       http://example.org/stuff/1.0/editor
       http://purl.org/net/dajobe
       http://example.org/stuff/1.0/fullName
       http://www.w3.org/TR/rdf-syntax-grammar
       http://purl.org/dc/elements/1.1/title

Minor mistake: you mean application/rdf+xml (note the plus).

Anyway, your problem is very simple:

catch(JenaException | NoSuchElementException c) {}

Bad! You're missing the error thrown here, and the output is being truncated:

System.out.println(o.asResource().getURI());

o isn't always a resource, and this will break on the triple

<http://www.w3.org/TR/rdf-syntax-grammar> dc:title "RDF/XML Syntax ..."

so you need to guard against that:

if (o.isResource()) System.out.println(o.asResource().getURI());

or even more specific:

if (o.isURIResource()) System.out.println(o.asResource().getURI());

which will skip the null output you see for ex:editor .

Now write one thousand times I will not swallow exceptions :-)

No, you don't understand what RDF is used for. A crawler is a program designed to retrieve online content and index it. A simple crawler can be fed with a HTML document and it will download (maybe recursively) all the documents mentioned in the href attributes of <a> elements.

RDF is full of URLs, so you may think it's perfect to feed a crawler, but unfortunately URL in an RDF document are not intended to retrieve other documents. Examples:

Can it be a coincidence? I don't think so. The fact is that RDF is intended to describe the real world and it happens that it can be serialized in XML form, but XML is not the only available serialization .

So, what are URLs used for in a document? They are used to name thing . How many John do you know? Possibly dozens, and still thousands of John's exist... However, if I own the domain example.com I can use the URL http://example.com/friends/John to refer to my friend named John. RDF can be used to describe that your friend John works at 123, Abc avenue, through two URLs and a string

"http://me.com/John"   "http://me.com/works_at"   "123, Abc avenue"

this is referred to as a triple , and URLs contained in it are not meant to point so something retrievable via a TCP socket and a client which understands the HTTP protocol. Note that both your friend (John) and the predicate (works at) are referenced in the triple via a URL. But if your try those URLs in the browser you'll get nothing.

I don't know why you are building your crawler and what it is supposed to do, but certainly RDF is not what you need to do your job.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM