简体   繁体   English

使用Scrapy创建一个三重链接存储

[英]Creating a triple-store of links using Scrapy

I'm using Scrapy to crawl over websites and create a set of links. 我正在使用Scrapy爬网网站并创建一组链接。 I want to be able to create an RDF Document using this data set; 我希望能够使用此数据集创建RDF文档。

My triples will be of the form, 我的三元组将是这样的形式

<ParentURL> - <HTML Text associated with Link> - <LinkURL>

Any pointers on how to proceed? 关于如何进行的任何指示? Help is appreciated. 感谢帮助。

<ParentURL> - <HTML Text associated with Link> - <LinkURL>

This would be a difficult representation, since the predicates in a RDF triple can only be URIs; 这将是一个困难的表示,因为RDF三元组中的谓词只能是URI。 they cannot be literals or blank nodes. 它们不能是文字或空白节点。 As I see it, you've got two easy options. 如我所见,您有两个简单的选择。

The first is that you could just use the triples as an opaque data structure and adopt a convention that subject = source URL, predicate = destination URL, and object = link text. 首先,您可以将三元组用作不透明的数据结构,并采用以下约定:subject =源URL,谓词=目标URL和object =链接文本。 That's not quite "RDF-ish", but it would work just fine for you, I think. 那不完全是“ RDF式的”,但我认为它对您来说很好。 This has the advantage that you can use a very simple RDF serialization, like N-Triples, and generate this very easily. 这样的好处是您可以使用非常简单的RDF序列化(例如N-Triples),并非常容易地生成它。 The N-Triples syntax is one triple per line, URIs are wrapped in angle brackets, and lines are terminated by '.'. N-Triples语法是每行一个三元组,URI用尖括号括起来,并且行以“。”结尾。 So If you use this representation, you'd just generate plain text like: 因此,如果使用此表示形式,您将只生成纯文本,例如:

<http://example.org/page1> <http://example.org/page2> "See page 2 for details." .
<http://example.org/page2> <http://example.org/page3> "See page 3 for even more details." .

That's a completely legal N-Triples document. 那是完全合法的N-Triples文件。 It doesn't get much easier than this. 没有比这容易的多了。

The second option would be to use a bit more structure. 第二种选择是使用更多的结构。 You'd want to write something like (in Turtle): 您想要编写类似(在Turtle中)的内容:

@prefix : <http://example.org/your-prefix/>

<http://example.org/page1> :linksTo [ :hasTargetURL <http://example.org/page2> ; :hasLinkText "see page 2" ] .
<http://example.org/page2> :linksTo [ :hasTargetURL <http://example.org/page3> ; :hasLinkText "see page 3" ] .

That uses three triples for each link instead of one. 每个链接使用三个三元组,而不是一个。 That said, it's still pretty easy to generate in plain text. 也就是说,以纯文本格式生成仍然非常容易。 It's probably just a matter of whether you want to minimize space (use the first option) or make the graph more semantically sensible (the second option). 这可能仅是您要最小化空间(使用第一个选项)还是使图形在语义上更合理(第二个选项)的问题。 Some triple stores will optimize queries for subjects and objects more than for predicates, and that would also favor the second option. 一些三重存储将优化对主语和宾语的查询,而不是对谓词的查询,这也将支持第二种选择。

As your data is very simple, I'd write a script to convert the json or csv output generated by Scrapy to RDF. 由于您的数据非常简单,因此我将编写一个脚本,将Scrapy生成的json或csv输出转换为RDF。

Or you can write an Item Exporter: 或者,您可以编写项目导出器:

http://doc.scrapy.org/en/1.0/topics/exporters.html http://doc.scrapy.org/en/1.0/topics/exporters.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM