简体   繁体   English

如何编写SPARQL查询,有效地匹配字符串文字而忽略大小写

[英]How to write SPARQL query that efficiently matches string literals while ignoring case

I am using Jena ARQ to write a SPARQL query against a large ontology being read from Jena TDB in order to find the types associated with concepts based on rdfs label: 我正在使用Jena ARQ编写针对从Jena TDB读取的大型本体的SPARQL查询,以便找到与基于rdfs标签的概念相关联的类型:

SELECT DISTINCT ?type WHERE {
 ?x <http://www.w3.org/2000/01/rdf-schema#label> "aspirin" .
 ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
}

This works pretty well and is actually quite speedy (<1 second). 这很好用,实际上非常快(<1秒)。 Unfortunately, for some terms, I need to perform this query in a case-insensitive way. 不幸的是,对于某些术语,我需要以不区分大小写的方式执行此查询。 For instance, because the label "Tylenol" is in the ontology, but not "tylenol" , the following query comes up empty: 例如,因为标签"Tylenol"在本体中,而不是"tylenol" ,所以以下查询为空:

SELECT DISTINCT ?type WHERE {
 ?x <http://www.w3.org/2000/01/rdf-schema#label> "tylenol" .
 ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
}

I can write a case-insensitive version of this query using FILTER syntax like so: 我可以使用FILTER语法编写此查询的不区分大小写的版本,如下所示:

SELECT DISTINCT ?type WHERE {
 ?x <http://www.w3.org/2000/01/rdf-schema#label> ?term .
 ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
 FILTER ( regex (str(?term), "tylenol", "i") )
}

But now the query takes over a minute to complete! 但现在查询需要一分钟才能完成! Is there any way to write the case-insensitive query in a more efficient manner? 有没有办法以更有效的方式编写不区分大小写的查询?

From all the the possible string operators that you can use in SPARQL, regex is probably the most expensive one. 从可以在SPARQL中使用的所有可能的字符串运算符中, regex可能是最昂贵的。 Your query might run faster if you avoid regex and you use UCASE or LCASE on both sides of the test instead. 如果您避免使用regex并且在测试的两侧使用UCASELCASEregex查询可能会运行得更快。 Something like: 就像是:

SELECT DISTINCT ?type WHERE {
 ?x <http://www.w3.org/2000/01/rdf-schema#label> ?term .
 ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
 FILTER (lcase(str(?term)) = "tylenol")
}

This might be faster but in general do not expect great performance for text search with any triple store. 这可能会更快,但通常不会期望使用任何三重存储的文本搜索具有出色的性能。 Triple stores are very good at graph matching and not so good at string matching. 三重商店非常擅长图形匹配,并且不太适合字符串匹配。

The reason the query with the FILTER query runs slower is because ?term is unbound it requires scanning the PSO or POS index to find all statements with the rdfs:label predicate and filter them against the regex. 使用FILTER查询的查询运行较慢的原因是因为?term是未绑定的,它需要扫描PSO或POS索引以查找具有rdfs:label谓词的所有语句,并针对正则表达式过滤它们。 When it was bound to a concrete resource (in your first example), it could use a OPS or POS index to scan over only statements with the rdfs:label predicate and the specified object resource, which would have a much lower cardinality. 当它绑定到具体资源时(在第一个示例中),它可以使用OPS或POS索引仅扫描具有rdfs:label谓词和指定对象资源的语句,这将具有低得多的基数。

The common solution to this type of text searching problem is to use an external text index. 这种文本搜索问题的常见解决方案是使用外部文本索引。 In this case, Jena provides a free text index called LARQ , which uses Lucene to perform the search and joins the results with the rest of the query. 在这种情况下,Jena提供了一个名为LARQ的自由文本索引,它使用Lucene执行搜索并将结果与​​查询的其余部分连接起来。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM