简体   繁体   English

使用SPARQL查询与字符串的最佳匹配?

[英]Query for best match to a string with SPARQL?

I have a list with movie titles and want to look these up in DBpedia for meta information like "director". 我有一个电影标题列表,想要在DBpedia中查找这些元素,如“导演”。 But I have trouble to identify the correct movie with SPARQL, because the titles sometimes don't exactly match. 但我很难用SPARQL识别正确的电影,因为标题有时并不完全匹配。

How can I get the best match for a movie title from DBpedia using SPARQL? 如何使用SPARQL从DBpedia获得电影片名的最佳匹配?

Some problematic examples: 一些有问题的例子:

  • My List: "Die Hard: with a Vengeance" vs. DBpedia: "Die Hard with a Vengeance" 我的名单:“死硬:复仇”与DBpedia:“复仇难死”
  • My List: "Hachi" vs. DBpedia: "Hachi: A Dog's Tale" 我的清单:“Hachi”与DBpedia:“Hachi:A Dog's Tale”

My current approach is to query the DBpedia endpoint for all movies and then filter by checking for single tokens (without punctuations), order by title and return the first result. 我目前的方法是查询所有电影的DBpedia端点 ,然后通过检查单个标记(没有标点符号)进行过滤,按标题排序并返回第一个结果。 Eg: 例如:

SELECT ?resource ?title ?director WHERE {
   ?resource foaf:name ?title .
   ?resource rdf:type schema:Movie .
   ?resource dbo:director ?director .
   FILTER (
      contains(lcase(str(?title)), "die") && 
      contains(lcase(str(?title)),"hard")
   )
}
ORDER BY (?title)
LIMIT 1

This approach is very slow and also sometimes fails, eg: 这种方法非常慢,有时也会失败,例如:

SELECT ?resource ?title ?director WHERE {
   ?resource foaf:name ?title .
   ?resource rdf:type schema:Movie .
   ?resource dbo:director ?director .
   FILTER (
      contains(lcase(str(?title)), "hachi") 
   )
}
ORDER BY (?title)
LIMIT 10

where the correct result is on second place: 其中正确的结果在第二位:

  resource                                          title                        director
  http://dbpedia.org/resource/Chachi_420            "Chachi 420"@en              http://dbpedia.org/resource/Kamal_Haasan
  http://dbpedia.org/resource/Hachi:_A_Dog's_Tale   "Hachi: A Dog's Tale"@en     http://dbpedia.org/resource/Lasse_Hallström    
  http://dbpedia.org/resource/Hachiko_Monogatari    "Hachikō Monogatari"@en      http://dbpedia.org/resource/Seijirō_Kōyama
  http://dbpedia.org/resource/Thachiledathu_Chundan "Thachiledathu Chundan"@en   http://dbpedia.org/resource/Shajoon_Kariyal

Any ideas how to solve this problem? 任何想法如何解决这个问题? Or even better: How to query for best matches to a string with SPARQL in general? 甚至更好: 如何通常使用SPARQL查询与字符串的最佳匹配?

Thanks! 谢谢!

I adapted the regex-approach mentioned in the comments and came up with a solution that works pretty well, better than anything I could get with bif:contains: 我修改了评论中提到的正则表达式方法并提出了一个非常好的解决方案,比我用bif得到的更好:contains:

   SELECT ?resource ?title ?match strlen(str(?title)) as ?lenTitle strlen(str(?match)) as ?lenMatch

   WHERE {
      ?resource foaf:name ?title .
      ?resource rdf:type schema:Movie .
      ?resource dbo:director ?director .
      bind( replace(LCASE(CONCAT('x',?title)), "^x(die)*(?:.*?(hard))*(?:.*?(with))*.*$", "$1$2$3") as ?match ) 
   }

   ORDER BY DESC(?lenMatch) ASC(?lenTitle)

   LIMIT 5

It's not perfect, so I'm still open for suggestions. 它并不完美,所以我仍然愿意接受建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM