简体   繁体   中英

Creating a triple-store of links using Scrapy

I'm using Scrapy to crawl over websites and create a set of links. I want to be able to create an RDF Document using this data set;

My triples will be of the form,

<ParentURL> - <HTML Text associated with Link> - <LinkURL>

Any pointers on how to proceed? Help is appreciated.

<ParentURL> - <HTML Text associated with Link> - <LinkURL>

This would be a difficult representation, since the predicates in a RDF triple can only be URIs; they cannot be literals or blank nodes. As I see it, you've got two easy options.

The first is that you could just use the triples as an opaque data structure and adopt a convention that subject = source URL, predicate = destination URL, and object = link text. That's not quite "RDF-ish", but it would work just fine for you, I think. This has the advantage that you can use a very simple RDF serialization, like N-Triples, and generate this very easily. The N-Triples syntax is one triple per line, URIs are wrapped in angle brackets, and lines are terminated by '.'. So If you use this representation, you'd just generate plain text like:

<http://example.org/page1> <http://example.org/page2> "See page 2 for details." .
<http://example.org/page2> <http://example.org/page3> "See page 3 for even more details." .

That's a completely legal N-Triples document. It doesn't get much easier than this.

The second option would be to use a bit more structure. You'd want to write something like (in Turtle):

@prefix : <http://example.org/your-prefix/>

<http://example.org/page1> :linksTo [ :hasTargetURL <http://example.org/page2> ; :hasLinkText "see page 2" ] .
<http://example.org/page2> :linksTo [ :hasTargetURL <http://example.org/page3> ; :hasLinkText "see page 3" ] .

That uses three triples for each link instead of one. That said, it's still pretty easy to generate in plain text. It's probably just a matter of whether you want to minimize space (use the first option) or make the graph more semantically sensible (the second option). Some triple stores will optimize queries for subjects and objects more than for predicates, and that would also favor the second option.

As your data is very simple, I'd write a script to convert the json or csv output generated by Scrapy to RDF.

Or you can write an Item Exporter:

http://doc.scrapy.org/en/1.0/topics/exporters.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM